A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
The cancellation of bookings impact a hotel on various fronts:
The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.
Data Dictionary
import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To filter the warnings
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Library to split data
from sklearn.model_selection import train_test_split
# To build linear model for statistical analysis and prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
# To get diferent metric scores
from sklearn import metrics
from sklearn.metrics import accuracy_score, roc_curve, confusion_matrix, roc_auc_score
df = pd.read_csv("INNHotelsGroup.csv")
df.shape
(36275, 19)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36275 entries, 0 to 36274 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Booking_ID 36275 non-null object 1 no_of_adults 36275 non-null int64 2 no_of_children 36275 non-null int64 3 no_of_weekend_nights 36275 non-null int64 4 no_of_week_nights 36275 non-null int64 5 type_of_meal_plan 36275 non-null object 6 required_car_parking_space 36275 non-null int64 7 room_type_reserved 36275 non-null object 8 lead_time 36275 non-null int64 9 arrival_year 36275 non-null int64 10 arrival_month 36275 non-null int64 11 arrival_date 36275 non-null int64 12 market_segment_type 36275 non-null object 13 repeated_guest 36275 non-null int64 14 no_of_previous_cancellations 36275 non-null int64 15 no_of_previous_bookings_not_canceled 36275 non-null int64 16 avg_price_per_room 36275 non-null float64 17 no_of_special_requests 36275 non-null int64 18 booking_status 36275 non-null object dtypes: float64(1), int64(13), object(5) memory usage: 5.3+ MB
df.isnull().sum()
Booking_ID 0 no_of_adults 0 no_of_children 0 no_of_weekend_nights 0 no_of_week_nights 0 type_of_meal_plan 0 required_car_parking_space 0 room_type_reserved 0 lead_time 0 arrival_year 0 arrival_month 0 arrival_date 0 market_segment_type 0 repeated_guest 0 no_of_previous_cancellations 0 no_of_previous_bookings_not_canceled 0 avg_price_per_room 0 no_of_special_requests 0 booking_status 0 dtype: int64
df.arrival_date.unique()
array([ 2, 6, 28, 20, 11, 13, 15, 26, 18, 30, 5, 10, 4, 25, 22, 21, 19,
17, 7, 9, 27, 1, 29, 16, 3, 24, 14, 31, 23, 8, 12],
dtype=int64)
df.nunique()
Booking_ID 36275 no_of_adults 5 no_of_children 6 no_of_weekend_nights 8 no_of_week_nights 18 type_of_meal_plan 4 required_car_parking_space 2 room_type_reserved 7 lead_time 352 arrival_year 2 arrival_month 12 arrival_date 31 market_segment_type 5 repeated_guest 2 no_of_previous_cancellations 9 no_of_previous_bookings_not_canceled 59 avg_price_per_room 3930 no_of_special_requests 6 booking_status 2 dtype: int64
The Booking ID number is essentially a count column. No duplicate values exist to categorize. This column can be excluded for this data processing exercise. The other object columns are type of meal plan, room type reserved, market segment type, and booking status (our dependent variable). These columns each have a low number of set categories that we can use for data processing. These object datatypes will be converted to category datatype below.
data = df.copy()
del data['Booking_ID']
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36275 entries, 0 to 36274 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 36275 non-null int64 1 no_of_children 36275 non-null int64 2 no_of_weekend_nights 36275 non-null int64 3 no_of_week_nights 36275 non-null int64 4 type_of_meal_plan 36275 non-null object 5 required_car_parking_space 36275 non-null int64 6 room_type_reserved 36275 non-null object 7 lead_time 36275 non-null int64 8 arrival_year 36275 non-null int64 9 arrival_month 36275 non-null int64 10 arrival_date 36275 non-null int64 11 market_segment_type 36275 non-null object 12 repeated_guest 36275 non-null int64 13 no_of_previous_cancellations 36275 non-null int64 14 no_of_previous_bookings_not_canceled 36275 non-null int64 15 avg_price_per_room 36275 non-null float64 16 no_of_special_requests 36275 non-null int64 17 booking_status 36275 non-null object dtypes: float64(1), int64(13), object(4) memory usage: 5.0+ MB
data[data.duplicated()].count()
no_of_adults 10275 no_of_children 10275 no_of_weekend_nights 10275 no_of_week_nights 10275 type_of_meal_plan 10275 required_car_parking_space 10275 room_type_reserved 10275 lead_time 10275 arrival_year 10275 arrival_month 10275 arrival_date 10275 market_segment_type 10275 repeated_guest 10275 no_of_previous_cancellations 10275 no_of_previous_bookings_not_canceled 10275 avg_price_per_room 10275 no_of_special_requests 10275 booking_status 10275 dtype: int64
data.drop_duplicates(inplace=True)
df.drop_duplicates(inplace=True)
Dropping duplicate values from the dataset.
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 26000 entries, 0 to 36273 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 26000 non-null int64 1 no_of_children 26000 non-null int64 2 no_of_weekend_nights 26000 non-null int64 3 no_of_week_nights 26000 non-null int64 4 type_of_meal_plan 26000 non-null object 5 required_car_parking_space 26000 non-null int64 6 room_type_reserved 26000 non-null object 7 lead_time 26000 non-null int64 8 arrival_year 26000 non-null int64 9 arrival_month 26000 non-null int64 10 arrival_date 26000 non-null int64 11 market_segment_type 26000 non-null object 12 repeated_guest 26000 non-null int64 13 no_of_previous_cancellations 26000 non-null int64 14 no_of_previous_bookings_not_canceled 26000 non-null int64 15 avg_price_per_room 26000 non-null float64 16 no_of_special_requests 26000 non-null int64 17 booking_status 26000 non-null object dtypes: float64(1), int64(13), object(4) memory usage: 3.8+ MB
for feature in data.columns: # Loop through all columns in the dataframe
if data[feature].dtype == 'object': # Only apply for columns with categorical strings
data[feature] = pd.Categorical(data[feature])# Replace strings with an integer
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 26000 entries, 0 to 36273 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 26000 non-null int64 1 no_of_children 26000 non-null int64 2 no_of_weekend_nights 26000 non-null int64 3 no_of_week_nights 26000 non-null int64 4 type_of_meal_plan 26000 non-null category 5 required_car_parking_space 26000 non-null int64 6 room_type_reserved 26000 non-null category 7 lead_time 26000 non-null int64 8 arrival_year 26000 non-null int64 9 arrival_month 26000 non-null int64 10 arrival_date 26000 non-null int64 11 market_segment_type 26000 non-null category 12 repeated_guest 26000 non-null int64 13 no_of_previous_cancellations 26000 non-null int64 14 no_of_previous_bookings_not_canceled 26000 non-null int64 15 avg_price_per_room 26000 non-null float64 16 no_of_special_requests 26000 non-null int64 17 booking_status 26000 non-null category dtypes: category(4), float64(1), int64(13) memory usage: 3.1 MB
data["booking_status"].value_counts(1)
Not_Canceled 0.713769 Canceled 0.286231 Name: booking_status, dtype: float64
71% of the data set did not cancel while 29% of the data set did cancel. This ratio will be maintained for modeling.
#data.sort_values(by=['arrival_date', 'arrival_month'], ascending=True)
test = data.loc[(data['arrival_month'] == 2) & (data['arrival_date'] == 29)]
test
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2626 | 2 | 0 | 1 | 5 | Meal Plan 1 | 0 | Room_Type 1 | 104 | 2018 | 2 | 29 | Online | 1 | 1 | 0 | 61.43 | 0 | Canceled |
| 3677 | 1 | 0 | 1 | 3 | Meal Plan 1 | 0 | Room_Type 1 | 21 | 2018 | 2 | 29 | Online | 0 | 0 | 0 | 102.05 | 0 | Canceled |
| 5600 | 2 | 0 | 1 | 3 | Meal Plan 1 | 0 | Room_Type 1 | 24 | 2018 | 2 | 29 | Offline | 0 | 0 | 0 | 45.50 | 0 | Not_Canceled |
| 6343 | 1 | 0 | 1 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 117 | 2018 | 2 | 29 | Offline | 0 | 0 | 0 | 76.00 | 0 | Not_Canceled |
| 7648 | 2 | 1 | 1 | 5 | Meal Plan 1 | 0 | Room_Type 1 | 35 | 2018 | 2 | 29 | Online | 0 | 0 | 0 | 98.10 | 1 | Canceled |
| 8000 | 2 | 2 | 1 | 3 | Meal Plan 1 | 0 | Room_Type 6 | 3 | 2018 | 2 | 29 | Online | 0 | 0 | 0 | 183.00 | 1 | Not_Canceled |
| 8989 | 1 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 117 | 2018 | 2 | 29 | Offline | 0 | 0 | 0 | 76.00 | 0 | Not_Canceled |
| 9153 | 2 | 2 | 1 | 3 | Meal Plan 1 | 0 | Room_Type 6 | 3 | 2018 | 2 | 29 | Online | 0 | 0 | 0 | 189.75 | 0 | Not_Canceled |
| 9245 | 2 | 0 | 1 | 3 | Meal Plan 1 | 0 | Room_Type 4 | 15 | 2018 | 2 | 29 | Online | 0 | 0 | 0 | 85.55 | 1 | Not_Canceled |
| 9664 | 1 | 0 | 1 | 0 | Meal Plan 1 | 0 | Room_Type 4 | 21 | 2018 | 2 | 29 | Online | 0 | 0 | 0 | 117.00 | 0 | Not_Canceled |
| 9934 | 1 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 45 | 2018 | 2 | 29 | Online | 0 | 0 | 0 | 76.30 | 0 | Not_Canceled |
| 10593 | 2 | 0 | 1 | 3 | Meal Plan 1 | 1 | Room_Type 4 | 47 | 2018 | 2 | 29 | Online | 0 | 0 | 0 | 99.40 | 1 | Not_Canceled |
| 10652 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 117 | 2018 | 2 | 29 | Offline | 0 | 0 | 0 | 86.33 | 0 | Not_Canceled |
| 10747 | 2 | 0 | 1 | 3 | Meal Plan 1 | 0 | Room_Type 1 | 88 | 2018 | 2 | 29 | Online | 0 | 0 | 0 | 56.94 | 0 | Canceled |
| 11881 | 1 | 0 | 3 | 7 | Meal Plan 1 | 0 | Room_Type 1 | 58 | 2018 | 2 | 29 | Online | 0 | 0 | 0 | 66.45 | 1 | Not_Canceled |
| 13958 | 1 | 0 | 1 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 0 | 2018 | 2 | 29 | Complementary | 0 | 0 | 0 | 3.00 | 0 | Not_Canceled |
| 14304 | 2 | 0 | 1 | 3 | Meal Plan 2 | 0 | Room_Type 1 | 13 | 2018 | 2 | 29 | Online | 0 | 0 | 0 | 114.55 | 0 | Not_Canceled |
| 15363 | 1 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 61 | 2018 | 2 | 29 | Online | 0 | 0 | 0 | 78.90 | 1 | Canceled |
| 15438 | 1 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 2 | 29 | Offline | 0 | 0 | 0 | 76.00 | 0 | Not_Canceled |
| 17202 | 2 | 0 | 1 | 3 | Meal Plan 2 | 0 | Room_Type 1 | 13 | 2018 | 2 | 29 | Online | 0 | 0 | 0 | 107.80 | 0 | Not_Canceled |
| 18534 | 1 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 5 | 3 | 2018 | 2 | 29 | Corporate | 0 | 0 | 0 | 107.00 | 0 | Not_Canceled |
| 18680 | 1 | 0 | 1 | 4 | Meal Plan 1 | 1 | Room_Type 1 | 4 | 2018 | 2 | 29 | Corporate | 1 | 0 | 11 | 68.00 | 1 | Not_Canceled |
| 19013 | 1 | 0 | 1 | 1 | Meal Plan 1 | 1 | Room_Type 1 | 7 | 2018 | 2 | 29 | Corporate | 0 | 0 | 0 | 68.00 | 0 | Not_Canceled |
| 20419 | 2 | 0 | 1 | 1 | Meal Plan 1 | 0 | Room_Type 4 | 33 | 2018 | 2 | 29 | Online | 0 | 0 | 0 | 106.40 | 0 | Not_Canceled |
| 21674 | 1 | 0 | 1 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 117 | 2018 | 2 | 29 | Offline | 0 | 0 | 0 | 75.00 | 0 | Not_Canceled |
| 21688 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 4 | 57 | 2018 | 2 | 29 | Online | 0 | 0 | 0 | 95.30 | 1 | Not_Canceled |
| 26108 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 39 | 2018 | 2 | 29 | Online | 0 | 0 | 0 | 60.10 | 0 | Not_Canceled |
| 27559 | 1 | 0 | 1 | 0 | Meal Plan 1 | 0 | Room_Type 1 | 0 | 2018 | 2 | 29 | Corporate | 1 | 0 | 10 | 65.00 | 1 | Not_Canceled |
| 27928 | 2 | 0 | 1 | 5 | Meal Plan 1 | 0 | Room_Type 4 | 115 | 2018 | 2 | 29 | Online | 0 | 0 | 0 | 102.33 | 1 | Not_Canceled |
| 30552 | 2 | 1 | 1 | 3 | Meal Plan 1 | 0 | Room_Type 4 | 13 | 2018 | 2 | 29 | Offline | 0 | 0 | 0 | 86.80 | 0 | Canceled |
| 30616 | 1 | 0 | 1 | 0 | Meal Plan 1 | 0 | Room_Type 5 | 21 | 2018 | 2 | 29 | Offline | 0 | 0 | 0 | 142.00 | 0 | Not_Canceled |
| 30632 | 3 | 0 | 1 | 2 | Meal Plan 2 | 0 | Room_Type 4 | 7 | 2018 | 2 | 29 | Online | 0 | 0 | 0 | 193.00 | 2 | Not_Canceled |
| 32041 | 2 | 0 | 1 | 0 | Not Selected | 0 | Room_Type 1 | 50 | 2018 | 2 | 29 | Online | 0 | 0 | 0 | 76.50 | 0 | Canceled |
| 34638 | 1 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 3 | 2018 | 2 | 29 | Corporate | 1 | 0 | 1 | 66.00 | 0 | Not_Canceled |
| 35481 | 1 | 0 | 1 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 7 | 2018 | 2 | 29 | Corporate | 0 | 0 | 0 | 66.00 | 0 | Not_Canceled |
It seems 35 rows of data were entered with the incorred Date/Month/Year of Feb 29, 2018. Since 2018 is not a leap year, this data is not valid. The 35 rows will be deleted from the dataset since the date is not accurate.
test = data[(data['arrival_month'] == 2) & (data['arrival_date'] == 29)].index
data.drop(test, inplace=True)
test = data.loc[(data['arrival_month'] == 2) & (data['arrival_date'] == 29)]
test
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status |
|---|
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 25965 entries, 0 to 36273 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 25965 non-null int64 1 no_of_children 25965 non-null int64 2 no_of_weekend_nights 25965 non-null int64 3 no_of_week_nights 25965 non-null int64 4 type_of_meal_plan 25965 non-null category 5 required_car_parking_space 25965 non-null int64 6 room_type_reserved 25965 non-null category 7 lead_time 25965 non-null int64 8 arrival_year 25965 non-null int64 9 arrival_month 25965 non-null int64 10 arrival_date 25965 non-null int64 11 market_segment_type 25965 non-null category 12 repeated_guest 25965 non-null int64 13 no_of_previous_cancellations 25965 non-null int64 14 no_of_previous_bookings_not_canceled 25965 non-null int64 15 avg_price_per_room 25965 non-null float64 16 no_of_special_requests 25965 non-null int64 17 booking_status 25965 non-null category dtypes: category(4), float64(1), int64(13) memory usage: 3.1 MB
The 35 rows containing impossible dates were deleted.
pd.to_datetime(dict(year=data.arrival_year, month=data.arrival_month, day=data.arrival_date))
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 25965 entries, 0 to 36273 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 25965 non-null int64 1 no_of_children 25965 non-null int64 2 no_of_weekend_nights 25965 non-null int64 3 no_of_week_nights 25965 non-null int64 4 type_of_meal_plan 25965 non-null category 5 required_car_parking_space 25965 non-null int64 6 room_type_reserved 25965 non-null category 7 lead_time 25965 non-null int64 8 arrival_year 25965 non-null int64 9 arrival_month 25965 non-null int64 10 arrival_date 25965 non-null int64 11 market_segment_type 25965 non-null category 12 repeated_guest 25965 non-null int64 13 no_of_previous_cancellations 25965 non-null int64 14 no_of_previous_bookings_not_canceled 25965 non-null int64 15 avg_price_per_room 25965 non-null float64 16 no_of_special_requests 25965 non-null int64 17 booking_status 25965 non-null category dtypes: category(4), float64(1), int64(13) memory usage: 3.1 MB
data.rename(columns = {'arrival_year':'year', 'arrival_month':'month', 'arrival_date':'day'}, inplace = True)
cols=["year","month","day"]
data['arrival_year_date'] = data[cols].apply(lambda x: '-'.join(x.values.astype(str)), axis="columns")
#pd.to_datetime(dict(year=updatedate.year, month=updatedate.month, day=updatedate.day))
#pd.to_datetime(dict(year=updatedate.year, month=updatedate.month, day=updatedate.day))
data['arrival_year_date'] = data['arrival_year_date'].astype('datetime64[ns]')
data.drop(['year', 'month', 'day'], axis = 1, inplace=True)
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 25965 entries, 0 to 36273 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 25965 non-null int64 1 no_of_children 25965 non-null int64 2 no_of_weekend_nights 25965 non-null int64 3 no_of_week_nights 25965 non-null int64 4 type_of_meal_plan 25965 non-null category 5 required_car_parking_space 25965 non-null int64 6 room_type_reserved 25965 non-null category 7 lead_time 25965 non-null int64 8 market_segment_type 25965 non-null category 9 repeated_guest 25965 non-null int64 10 no_of_previous_cancellations 25965 non-null int64 11 no_of_previous_bookings_not_canceled 25965 non-null int64 12 avg_price_per_room 25965 non-null float64 13 no_of_special_requests 25965 non-null int64 14 booking_status 25965 non-null category 15 arrival_year_date 25965 non-null datetime64[ns] dtypes: category(4), datetime64[ns](1), float64(1), int64(10) memory usage: 2.7 MB
Combined the year, month, date columns inot a single arrival_year_date column with a datetime datatype.
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| no_of_adults | 25965.0 | 1.890468 | 0.528516 | 0.0 | 2.00 | 2.0 | 2.0 | 4.0 |
| no_of_children | 25965.0 | 0.141190 | 0.462439 | 0.0 | 0.00 | 0.0 | 0.0 | 10.0 |
| no_of_weekend_nights | 25965.0 | 0.882149 | 0.887862 | 0.0 | 0.00 | 1.0 | 2.0 | 7.0 |
| no_of_week_nights | 25965.0 | 2.261814 | 1.512290 | 0.0 | 1.00 | 2.0 | 3.0 | 17.0 |
| required_car_parking_space | 25965.0 | 0.042057 | 0.200722 | 0.0 | 0.00 | 0.0 | 0.0 | 1.0 |
| lead_time | 25965.0 | 66.489313 | 68.630134 | 0.0 | 12.00 | 44.0 | 100.0 | 443.0 |
| repeated_guest | 25965.0 | 0.032659 | 0.177747 | 0.0 | 0.00 | 0.0 | 0.0 | 1.0 |
| no_of_previous_cancellations | 25965.0 | 0.028538 | 0.409121 | 0.0 | 0.00 | 0.0 | 0.0 | 13.0 |
| no_of_previous_bookings_not_canceled | 25965.0 | 0.212555 | 2.067643 | 0.0 | 0.00 | 0.0 | 0.0 | 58.0 |
| avg_price_per_room | 25965.0 | 105.715938 | 37.871936 | 0.0 | 80.75 | 100.0 | 127.0 | 540.0 |
| no_of_special_requests | 25965.0 | 0.742500 | 0.815293 | 0.0 | 0.00 | 1.0 | 1.0 | 5.0 |
Leading Questions:
# Univariate Analysis
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
histogram_boxplot(data,'no_of_adults')
histogram_boxplot(data,'no_of_children')
histogram_boxplot(data,'no_of_weekend_nights')
histogram_boxplot(data,'no_of_week_nights')
histogram_boxplot(data,'lead_time')
histogram_boxplot(data,'no_of_previous_cancellations')
histogram_boxplot(data,'no_of_previous_bookings_not_canceled')
histogram_boxplot(data,'avg_price_per_room')
histogram_boxplot(data,'no_of_special_requests')
labeled_barplot(data, "repeated_guest", perc=True)
Only 3.3% of all guests are repeated guests.
labeled_barplot(data, "required_car_parking_space", perc=True)
Only 4.2% of guests required parking.
labeled_barplot(data, "type_of_meal_plan", perc=True)
labeled_barplot(data, "room_type_reserved", perc=True)
labeled_barplot(data, "market_segment_type", perc=True)
77% of all bookings were made online.
labeled_barplot(data, "booking_status", perc=True)
sns.set(font_scale = 4)
ax = sns.boxplot(x="market_segment_type", y="lead_time", hue = "booking_status", data=data)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)
ax.set_xlabel("Market Segment Type", fontsize = 30)
ax.set_ylabel("Lead Time (days)", fontsize = 30)
sns.set(rc = {'figure.figsize':(30,20)})
plt.show()
We can see that offline and online reservations have on average a longer lead time allowing for more time to book another guest in the same room. On average, aviation has the shortest lead time for both booked and cancelled reservations.
sns.set(font_scale = 5)
ax = sns.boxplot(x="market_segment_type", y="avg_price_per_room", hue = "booking_status", data=data)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)
ax.set_xlabel("Market Segment Type", fontsize = 30)
ax.set_ylabel("avg_price_per_room", fontsize = 30)
sns.set(rc = {'figure.figsize':(30,20)})
plt.show()
It appears that on average the cancelled rooms are booked at a higher rate than the rooms that were not cancelled across the
sns.set(font_scale = 5)
ax = sns.boxplot(x="market_segment_type", y="no_of_weekend_nights", hue = "booking_status", data=data)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)
ax.set_xlabel("Market Segment Type", fontsize = 30)
ax.set_ylabel("no_of_weekend_nights", fontsize = 30)
sns.set(rc = {'figure.figsize':(30,20)})
plt.show()
sns.set(font_scale = 5)
ax = sns.boxplot(x="market_segment_type", y="no_of_week_nights", hue = "booking_status", data=data)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)
ax.set_xlabel("Market Segment Type", fontsize = 30)
ax.set_ylabel("no_of_week_nights", fontsize = 30)
sns.set(rc = {'figure.figsize':(30,20)})
plt.show()
sns.set(font_scale = 5)
ax = sns.boxplot(x="no_of_children", y="avg_price_per_room", hue = "booking_status", data=data)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)
ax.set_xlabel("no_of_children", fontsize = 30)
ax.set_ylabel("avg_price_per_room", fontsize = 30)
sns.set(rc = {'figure.figsize':(30,20)})
plt.show()
sns.set(font_scale = 5)
ax = sns.boxplot(x="no_of_adults", y="avg_price_per_room", hue = "booking_status", data=data)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)
ax.set_xlabel("no_of_adults", fontsize = 30)
ax.set_ylabel("avg_price_per_room", fontsize = 30)
sns.set(rc = {'figure.figsize':(30,20)})
plt.show()
Both number of adults and number of children have generally positive correlations with increasing room price. Children numbers 3 and more show a decrease in cost, while adults show a continuous increase in room price with each additional adult.
stacked_barplot(data, "type_of_meal_plan", "booking_status")
booking_status Canceled Not_Canceled All type_of_meal_plan All 7435 18530 25965 Meal Plan 1 5611 14756 20367 Not Selected 1399 3063 4462 Meal Plan 2 424 707 1131 Meal Plan 3 1 4 5 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "market_segment_type", "booking_status")
booking_status Canceled Not_Canceled All market_segment_type All 7435 18530 25965 Online 6795 13204 19999 Offline 487 3617 4104 Corporate 130 1276 1406 Aviation 23 77 100 Complementary 0 356 356 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "room_type_reserved", "booking_status")
booking_status Canceled Not_Canceled All room_type_reserved All 7435 18530 25965 Room_Type 1 4942 13675 18617 Room_Type 4 1806 3609 5415 Room_Type 6 393 548 941 Room_Type 2 199 401 600 Room_Type 5 59 171 230 Room_Type 7 34 122 156 Room_Type 3 2 4 6 ------------------------------------------------------------------------------------------------------------------------
plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, cmap="Spectral")
plt.show()
It appears there may be a slight correlation between number of bookings previously not cancelled and the repeated guest columns. I will wait for confirmation from modeling to see which column to possibly remove.
sns.set(font_scale = 5)
ax = sns.boxplot(x="repeated_guest", y="no_of_previous_bookings_not_canceled", hue = "booking_status", data=data)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)
ax.set_xlabel("repeated_guest", fontsize = 30)
ax.set_ylabel("no_of_previous_bookings_not_canceled", fontsize = 30)
sns.set(rc = {'figure.figsize':(30,20)})
plt.show()
sns.pairplot(data=data, hue="booking_status" )
#hue="booking_status"
plt.show()
sns.set_style("darkgrid")
data.hist(figsize=(15, 10))
plt.show()
distribution_plot_wrt_target(data, "avg_price_per_room", "booking_status")
distribution_plot_wrt_target(data, "lead_time", "booking_status")
data["booking_status"].value_counts(1)
Not_Canceled 0.713653 Canceled 0.286347 Name: booking_status, dtype: float64
From the data above we see approximately 71% of the rooms booked were not canceled, with 29% of the rooms booked were canceled. We will maintain this ratio of data as we segment for modeling.
data.isnull().values.any()
False
No missing values in the dataset.
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 25965 entries, 0 to 36273 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 25965 non-null int64 1 no_of_children 25965 non-null int64 2 no_of_weekend_nights 25965 non-null int64 3 no_of_week_nights 25965 non-null int64 4 type_of_meal_plan 25965 non-null category 5 required_car_parking_space 25965 non-null int64 6 room_type_reserved 25965 non-null category 7 lead_time 25965 non-null int64 8 market_segment_type 25965 non-null category 9 repeated_guest 25965 non-null int64 10 no_of_previous_cancellations 25965 non-null int64 11 no_of_previous_bookings_not_canceled 25965 non-null int64 12 avg_price_per_room 25965 non-null float64 13 no_of_special_requests 25965 non-null int64 14 booking_status 25965 non-null category 15 arrival_year_date 25965 non-null datetime64[ns] dtypes: category(4), datetime64[ns](1), float64(1), int64(10) memory usage: 3.7 MB
plt.figure(figsize=(15, 7))
sns.heatmap(pdata.corr(), annot=True, vmin=-1, vmax=1, cmap="Spectral")
plt.show()
There are a few high correlations, that will be closely observed once we begin to trim the model. Number of children and room type 6 reserved have a high correlation of 0.65. Repeated guest also correlates with Corporate market segment with a 0.51 correlation. There are several highly negative correlations between meal plan Type 1 and not selecting a meal plan of -0.87. Room type 1 also has a -0.82 correlation with Room Type 4. The offline market segment has a -0.79 negative correlation with market segment online.
data["date"] = data["arrival_year_date"].values.astype(float)
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 25965 entries, 0 to 36273 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 25965 non-null int64 1 no_of_children 25965 non-null int64 2 no_of_weekend_nights 25965 non-null int64 3 no_of_week_nights 25965 non-null int64 4 type_of_meal_plan 25965 non-null category 5 required_car_parking_space 25965 non-null int64 6 room_type_reserved 25965 non-null category 7 lead_time 25965 non-null int64 8 market_segment_type 25965 non-null category 9 repeated_guest 25965 non-null int64 10 no_of_previous_cancellations 25965 non-null int64 11 no_of_previous_bookings_not_canceled 25965 non-null int64 12 avg_price_per_room 25965 non-null float64 13 no_of_special_requests 25965 non-null int64 14 booking_status 25965 non-null category 15 arrival_year_date 25965 non-null datetime64[ns] 16 date 25965 non-null float64 dtypes: category(4), datetime64[ns](1), float64(2), int64(10) memory usage: 3.9 MB
ldata = data.copy()
numerical_col = ldata.select_dtypes(include=np.number).columns.tolist()
#plt.figure(figsize=(20, 30))
#sns.set(font_scale = 4)
#ax = sns.boxplot(x="market_segment_type", y="lead_time", hue = "booking_status", data=data)
#ax.set_xticklabels(ax.get_xticklabels(),rotation = 90)
#ax.set_xlabel("Market Segment Type", fontsize = 30)
#ax.set_ylabel("Lead Time (days)", fontsize = 30)
#sns.set(rc = {'figure.figsize':(30,20)})
plt.show()
for i, variable in enumerate(numerical_col):
plt.subplot(5, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.rc('axes', titlesize=10)
plt.rc('axes', labelsize=10)
plt.figure(figsize=(10, 10), dpi=80)
plt.show()
<Figure size 800x800 with 0 Axes>
# functions to treat outliers by flooring and capping
def treat_outliers(df, col):
"""
Treats outliers in a variable
df: dataframe
col: dataframe column
"""
Q1 = df[col].quantile(0.25) # 25th quantile
Q3 = df[col].quantile(0.75) # 75th quantile
IQR = Q3 - Q1
Lower_Whisker = Q1 - 1.5 * IQR
Upper_Whisker = Q3 + 1.5 * IQR
# all the values smaller than Lower_Whisker will be assigned the value of Lower_Whisker
# all the values greater than Upper_Whisker will be assigned the value of Upper_Whisker
df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker)
return df
def treat_outliers_all(df, col_list):
"""
Treat outliers in a list of variables
df: dataframe
col_list: list of dataframe columns
"""
for c in col_list:
df = treat_outliers(df, c)
return df
cols_list = ("no_of_week_nights", 'no_of_weekend_nights', 'lead_time', 'avg_price_per_room')
ldata = treat_outliers_all(ldata, cols_list)
# let's look at box plot to see if outliers have been treated or not
plt.figure(figsize=(20, 30))
for i, variable in enumerate(numerical_col):
plt.subplot(5, 4, i + 1)
plt.boxplot(ldata[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
replaceStruct = {
"type_of_meal_plan": {"Meal Plan 1": 1, "Meal Plan 2": 2 ,"Meal Plan 3": 3 ,"Not Selected":-1},
"room_type_reserved": {"Room_Type 1": 1, "Room_Type 2":2 , "Room_Type 3": 3, "Room_Type 4": 4,"Room_Type 5": 5,
"Room_Type 6": 6,"Room_Type 7": 7},
"market_segment_type": {"Online": 1, "Offline":2 , "Corporate": 3, "Complementary": 4,"Aviation": 5},
"booking_status": {"Not_Canceled":1, "Canceled":0}}
oneHotCols=["type_of_meal_plan","room_type_reserved","market_segment_type", "booking_status"]
ldata=data.replace(replaceStruct)
ldata=pd.get_dummies(data, columns=oneHotCols)
ldata.head(10)
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | ... | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Aviation | market_segment_type_Complementary | market_segment_type_Corporate | market_segment_type_Offline | market_segment_type_Online | booking_status_Canceled | booking_status_Not_Canceled | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 0 | 1 | 2 | 0 | 224 | 0 | 0 | 0 | 65.00 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
| 1 | 2 | 0 | 2 | 3 | 0 | 5 | 0 | 0 | 0 | 106.68 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 2 | 1 | 0 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 60.00 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| 3 | 2 | 0 | 0 | 2 | 0 | 211 | 0 | 0 | 0 | 100.00 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| 4 | 2 | 0 | 1 | 1 | 0 | 48 | 0 | 0 | 0 | 94.50 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| 5 | 2 | 0 | 0 | 2 | 0 | 346 | 0 | 0 | 0 | 115.00 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| 6 | 2 | 0 | 1 | 3 | 0 | 34 | 0 | 0 | 0 | 107.55 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 7 | 2 | 0 | 1 | 3 | 0 | 83 | 0 | 0 | 0 | 105.61 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 8 | 3 | 0 | 0 | 4 | 0 | 121 | 0 | 0 | 0 | 96.90 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
| 9 | 2 | 0 | 0 | 5 | 0 | 44 | 0 | 0 | 0 | 133.44 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
10 rows × 31 columns
ldata["date"] = ldata["arrival_year_date"].values.astype(float)
ldata.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 25965 entries, 0 to 36273 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 25965 non-null int64 1 no_of_children 25965 non-null int64 2 no_of_weekend_nights 25965 non-null int64 3 no_of_week_nights 25965 non-null int64 4 required_car_parking_space 25965 non-null int64 5 lead_time 25965 non-null int64 6 repeated_guest 25965 non-null int64 7 no_of_previous_cancellations 25965 non-null int64 8 no_of_previous_bookings_not_canceled 25965 non-null int64 9 avg_price_per_room 25965 non-null float64 10 no_of_special_requests 25965 non-null int64 11 arrival_year_date 25965 non-null datetime64[ns] 12 date 25965 non-null float64 13 type_of_meal_plan_Meal Plan 1 25965 non-null uint8 14 type_of_meal_plan_Meal Plan 2 25965 non-null uint8 15 type_of_meal_plan_Meal Plan 3 25965 non-null uint8 16 type_of_meal_plan_Not Selected 25965 non-null uint8 17 room_type_reserved_Room_Type 1 25965 non-null uint8 18 room_type_reserved_Room_Type 2 25965 non-null uint8 19 room_type_reserved_Room_Type 3 25965 non-null uint8 20 room_type_reserved_Room_Type 4 25965 non-null uint8 21 room_type_reserved_Room_Type 5 25965 non-null uint8 22 room_type_reserved_Room_Type 6 25965 non-null uint8 23 room_type_reserved_Room_Type 7 25965 non-null uint8 24 market_segment_type_Aviation 25965 non-null uint8 25 market_segment_type_Complementary 25965 non-null uint8 26 market_segment_type_Corporate 25965 non-null uint8 27 market_segment_type_Offline 25965 non-null uint8 28 market_segment_type_Online 25965 non-null uint8 29 booking_status_Canceled 25965 non-null uint8 30 booking_status_Not_Canceled 25965 non-null uint8 dtypes: datetime64[ns](1), float64(2), int64(10), uint8(18) memory usage: 4.2 MB
ldata.drop(['arrival_year_date', 'booking_status_Not_Canceled'], axis=1, inplace=True)
ldata.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 25965 entries, 0 to 36273 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 25965 non-null int64 1 no_of_children 25965 non-null int64 2 no_of_weekend_nights 25965 non-null int64 3 no_of_week_nights 25965 non-null int64 4 required_car_parking_space 25965 non-null int64 5 lead_time 25965 non-null int64 6 repeated_guest 25965 non-null int64 7 no_of_previous_cancellations 25965 non-null int64 8 no_of_previous_bookings_not_canceled 25965 non-null int64 9 avg_price_per_room 25965 non-null float64 10 no_of_special_requests 25965 non-null int64 11 date 25965 non-null float64 12 type_of_meal_plan_Meal Plan 1 25965 non-null uint8 13 type_of_meal_plan_Meal Plan 2 25965 non-null uint8 14 type_of_meal_plan_Meal Plan 3 25965 non-null uint8 15 type_of_meal_plan_Not Selected 25965 non-null uint8 16 room_type_reserved_Room_Type 1 25965 non-null uint8 17 room_type_reserved_Room_Type 2 25965 non-null uint8 18 room_type_reserved_Room_Type 3 25965 non-null uint8 19 room_type_reserved_Room_Type 4 25965 non-null uint8 20 room_type_reserved_Room_Type 5 25965 non-null uint8 21 room_type_reserved_Room_Type 6 25965 non-null uint8 22 room_type_reserved_Room_Type 7 25965 non-null uint8 23 market_segment_type_Aviation 25965 non-null uint8 24 market_segment_type_Complementary 25965 non-null uint8 25 market_segment_type_Corporate 25965 non-null uint8 26 market_segment_type_Offline 25965 non-null uint8 27 market_segment_type_Online 25965 non-null uint8 28 booking_status_Canceled 25965 non-null uint8 dtypes: float64(2), int64(10), uint8(17) memory usage: 4.0 MB
print(data.type_of_meal_plan.value_counts())
print(data.room_type_reserved.value_counts())
print(data.market_segment_type.value_counts())
print(data.booking_status.value_counts())
pdata = data.copy()
pdata.head()
Meal Plan 1 20367 Not Selected 4462 Meal Plan 2 1131 Meal Plan 3 5 Name: type_of_meal_plan, dtype: int64 Room_Type 1 18617 Room_Type 4 5415 Room_Type 6 941 Room_Type 2 600 Room_Type 5 230 Room_Type 7 156 Room_Type 3 6 Name: room_type_reserved, dtype: int64 Online 19999 Offline 4104 Corporate 1406 Complementary 356 Aviation 100 Name: market_segment_type, dtype: int64 Not_Canceled 18530 Canceled 7435 Name: booking_status, dtype: int64
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | arrival_year_date | date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | Offline | 0 | 0 | 0 | 65.00 | 0 | Not_Canceled | 2017-10-02 | 1.506902e+18 |
| 1 | 2 | 0 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5 | Online | 0 | 0 | 0 | 106.68 | 1 | Not_Canceled | 2018-11-06 | 1.541462e+18 |
| 2 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | Online | 0 | 0 | 0 | 60.00 | 0 | Canceled | 2018-02-28 | 1.519776e+18 |
| 3 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211 | Online | 0 | 0 | 0 | 100.00 | 0 | Canceled | 2018-05-20 | 1.526774e+18 |
| 4 | 2 | 0 | 1 | 1 | Not Selected | 0 | Room_Type 1 | 48 | Online | 0 | 0 | 0 | 94.50 | 0 | Canceled | 2018-04-11 | 1.523405e+18 |
replaceStruct = {
"type_of_meal_plan": {"Meal Plan 1": 1, "Meal Plan 2": 2 ,"Meal Plan 3": 3 ,"Not Selected":-1},
"room_type_reserved": {"Room_Type 1": 1, "Room_Type 2":2 , "Room_Type 3": 3, "Room_Type 4": 4,"Room_Type 5": 5,
"Room_Type 6": 6,"Room_Type 7": 7},
"market_segment_type": {"Online": 1, "Offline":2 , "Corporate": 3, "Complementary": 4,"Aviation": 5},
"booking_status": {"Not_Canceled":1, "Canceled":0}}
oneHotCols=["type_of_meal_plan","room_type_reserved","market_segment_type", "booking_status"]
pdata=data.replace(replaceStruct)
pdata=pd.get_dummies(data, columns=oneHotCols)
pdata.head(10)
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | arrival_year_date | date | type_of_meal_plan_Meal Plan 1 | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Meal Plan 3 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 1 | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 3 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Aviation | market_segment_type_Complementary | market_segment_type_Corporate | market_segment_type_Offline | market_segment_type_Online | booking_status_Canceled | booking_status_Not_Canceled | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 0 | 1 | 2 | 0 | 224 | 0 | 0 | 0 | 65.00 | 0 | 2017-10-02 | 1.506902e+18 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
| 1 | 2 | 0 | 2 | 3 | 0 | 5 | 0 | 0 | 0 | 106.68 | 1 | 2018-11-06 | 1.541462e+18 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 2 | 1 | 0 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 60.00 | 0 | 2018-02-28 | 1.519776e+18 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| 3 | 2 | 0 | 0 | 2 | 0 | 211 | 0 | 0 | 0 | 100.00 | 0 | 2018-05-20 | 1.526774e+18 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| 4 | 2 | 0 | 1 | 1 | 0 | 48 | 0 | 0 | 0 | 94.50 | 0 | 2018-04-11 | 1.523405e+18 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| 5 | 2 | 0 | 0 | 2 | 0 | 346 | 0 | 0 | 0 | 115.00 | 1 | 2018-09-13 | 1.536797e+18 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| 6 | 2 | 0 | 1 | 3 | 0 | 34 | 0 | 0 | 0 | 107.55 | 1 | 2017-10-15 | 1.508026e+18 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 7 | 2 | 0 | 1 | 3 | 0 | 83 | 0 | 0 | 0 | 105.61 | 1 | 2018-12-26 | 1.545782e+18 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 8 | 3 | 0 | 0 | 4 | 0 | 121 | 0 | 0 | 0 | 96.90 | 1 | 2018-07-06 | 1.530835e+18 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
| 9 | 2 | 0 | 0 | 5 | 0 | 44 | 0 | 0 | 0 | 133.44 | 3 | 2018-10-18 | 1.539821e+18 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
This breaks out all the categorical variables and dropped the booking status column since that is the dependent variable.
pdata["date"] = pdata["arrival_year_date"].values.astype(float)
pdata.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 25965 entries, 0 to 36273 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 25965 non-null int64 1 no_of_children 25965 non-null int64 2 no_of_weekend_nights 25965 non-null int64 3 no_of_week_nights 25965 non-null int64 4 required_car_parking_space 25965 non-null int64 5 lead_time 25965 non-null int64 6 repeated_guest 25965 non-null int64 7 no_of_previous_cancellations 25965 non-null int64 8 no_of_previous_bookings_not_canceled 25965 non-null int64 9 avg_price_per_room 25965 non-null float64 10 no_of_special_requests 25965 non-null int64 11 arrival_year_date 25965 non-null datetime64[ns] 12 date 25965 non-null float64 13 type_of_meal_plan_Meal Plan 1 25965 non-null uint8 14 type_of_meal_plan_Meal Plan 2 25965 non-null uint8 15 type_of_meal_plan_Meal Plan 3 25965 non-null uint8 16 type_of_meal_plan_Not Selected 25965 non-null uint8 17 room_type_reserved_Room_Type 1 25965 non-null uint8 18 room_type_reserved_Room_Type 2 25965 non-null uint8 19 room_type_reserved_Room_Type 3 25965 non-null uint8 20 room_type_reserved_Room_Type 4 25965 non-null uint8 21 room_type_reserved_Room_Type 5 25965 non-null uint8 22 room_type_reserved_Room_Type 6 25965 non-null uint8 23 room_type_reserved_Room_Type 7 25965 non-null uint8 24 market_segment_type_Aviation 25965 non-null uint8 25 market_segment_type_Complementary 25965 non-null uint8 26 market_segment_type_Corporate 25965 non-null uint8 27 market_segment_type_Offline 25965 non-null uint8 28 market_segment_type_Online 25965 non-null uint8 29 booking_status_Canceled 25965 non-null uint8 30 booking_status_Not_Canceled 25965 non-null uint8 dtypes: datetime64[ns](1), float64(2), int64(10), uint8(18) memory usage: 4.2 MB
pdata.head()
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | arrival_year_date | date | type_of_meal_plan_Meal Plan 1 | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Meal Plan 3 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 1 | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 3 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Aviation | market_segment_type_Complementary | market_segment_type_Corporate | market_segment_type_Offline | market_segment_type_Online | booking_status_Canceled | booking_status_Not_Canceled | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 0 | 1 | 2 | 0 | 224 | 0 | 0 | 0 | 65.00 | 0 | 2017-10-02 | 1.506902e+18 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
| 1 | 2 | 0 | 2 | 3 | 0 | 5 | 0 | 0 | 0 | 106.68 | 1 | 2018-11-06 | 1.541462e+18 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 2 | 1 | 0 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 60.00 | 0 | 2018-02-28 | 1.519776e+18 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| 3 | 2 | 0 | 0 | 2 | 0 | 211 | 0 | 0 | 0 | 100.00 | 0 | 2018-05-20 | 1.526774e+18 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| 4 | 2 | 0 | 1 | 1 | 0 | 48 | 0 | 0 | 0 | 94.50 | 0 | 2018-04-11 | 1.523405e+18 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
pdata.drop(['arrival_year_date', 'booking_status_Canceled'], axis=1, inplace=True)
pdata.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 25965 entries, 0 to 36273 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 25965 non-null int64 1 no_of_children 25965 non-null int64 2 no_of_weekend_nights 25965 non-null int64 3 no_of_week_nights 25965 non-null int64 4 required_car_parking_space 25965 non-null int64 5 lead_time 25965 non-null int64 6 repeated_guest 25965 non-null int64 7 no_of_previous_cancellations 25965 non-null int64 8 no_of_previous_bookings_not_canceled 25965 non-null int64 9 avg_price_per_room 25965 non-null float64 10 no_of_special_requests 25965 non-null int64 11 date 25965 non-null float64 12 type_of_meal_plan_Meal Plan 1 25965 non-null uint8 13 type_of_meal_plan_Meal Plan 2 25965 non-null uint8 14 type_of_meal_plan_Meal Plan 3 25965 non-null uint8 15 type_of_meal_plan_Not Selected 25965 non-null uint8 16 room_type_reserved_Room_Type 1 25965 non-null uint8 17 room_type_reserved_Room_Type 2 25965 non-null uint8 18 room_type_reserved_Room_Type 3 25965 non-null uint8 19 room_type_reserved_Room_Type 4 25965 non-null uint8 20 room_type_reserved_Room_Type 5 25965 non-null uint8 21 room_type_reserved_Room_Type 6 25965 non-null uint8 22 room_type_reserved_Room_Type 7 25965 non-null uint8 23 market_segment_type_Aviation 25965 non-null uint8 24 market_segment_type_Complementary 25965 non-null uint8 25 market_segment_type_Corporate 25965 non-null uint8 26 market_segment_type_Offline 25965 non-null uint8 27 market_segment_type_Online 25965 non-null uint8 28 booking_status_Canceled 25965 non-null uint8 dtypes: float64(2), int64(10), uint8(17) memory usage: 4.0 MB
ldata.head()
| no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | ... | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Aviation | market_segment_type_Complementary | market_segment_type_Corporate | market_segment_type_Offline | market_segment_type_Online | booking_status_Canceled | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 0 | 1 | 2 | 0 | 224 | 0 | 0 | 0 | 65.00 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 1 | 2 | 0 | 2 | 3 | 0 | 5 | 0 | 0 | 0 | 106.68 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2 | 1 | 0 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 60.00 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 3 | 2 | 0 | 0 | 2 | 0 | 211 | 0 | 0 | 0 | 100.00 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 4 | 2 | 0 | 1 | 1 | 0 | 48 | 0 | 0 | 0 | 94.50 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
5 rows × 29 columns
tdata = data.copy()
meal_plan = pd.get_dummies(tdata['type_of_meal_plan'], drop_first=True)
room_type = pd.get_dummies(tdata['room_type_reserved'], drop_first=True)
market_segment = pd.get_dummies(tdata['market_segment_type'], drop_first=True)
booking_status = pd.get_dummies(tdata['booking_status'], drop_first=True)
tdata.drop(['type_of_meal_plan', 'room_type_reserved', 'market_segment_type', 'booking_status', 'arrival_year_date'], axis=1, inplace=True)
tdata = pd.concat((tdata, meal_plan, room_type, market_segment, booking_status), axis=1)
tdata.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 25965 entries, 0 to 36273 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 25965 non-null int64 1 no_of_children 25965 non-null int64 2 no_of_weekend_nights 25965 non-null int64 3 no_of_week_nights 25965 non-null int64 4 required_car_parking_space 25965 non-null int64 5 lead_time 25965 non-null int64 6 repeated_guest 25965 non-null int64 7 no_of_previous_cancellations 25965 non-null int64 8 no_of_previous_bookings_not_canceled 25965 non-null int64 9 avg_price_per_room 25965 non-null float64 10 no_of_special_requests 25965 non-null int64 11 date 25965 non-null float64 12 Meal Plan 2 25965 non-null uint8 13 Meal Plan 3 25965 non-null uint8 14 Not Selected 25965 non-null uint8 15 Room_Type 2 25965 non-null uint8 16 Room_Type 3 25965 non-null uint8 17 Room_Type 4 25965 non-null uint8 18 Room_Type 5 25965 non-null uint8 19 Room_Type 6 25965 non-null uint8 20 Room_Type 7 25965 non-null uint8 21 Complementary 25965 non-null uint8 22 Corporate 25965 non-null uint8 23 Offline 25965 non-null uint8 24 Online 25965 non-null uint8 25 Not_Canceled 25965 non-null uint8 dtypes: float64(2), int64(10), uint8(14) memory usage: 3.9 MB
X = tdata.drop(["Not_Canceled"], axis=1)
Y = tdata["Not_Canceled"]
# adding a constant to X variable
X = add_constant(X)
# creating dummies
X = pd.get_dummies(X, drop_first=True)
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1, stratify=Y
)
The Stratify argument maintains the original distribution of classes in the target variable while splitting the data into train and test sets.
# fitting the model on training set
logit = sm.Logit(y_train, X_train.astype(float))
lg = logit.fit()
Warning: Maximum number of iterations has been exceeded.
Current function value: inf
Iterations: 35
C:\Users\eliza\anaconda3\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
# let's print the logistic regression summary
print(lg.summary())
Logit Regression Results
==============================================================================
Dep. Variable: Not_Canceled No. Observations: 18175
Model: Logit Df Residuals: 18149
Method: MLE Df Model: 25
Date: Fri, 13 May 2022 Pseudo R-squ.: inf
Time: 11:15:19 Log-Likelihood: -inf
converged: False LL-Null: 0.0000
Covariance Type: nonrobust LLR p-value: 1.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const 3.7360 3.215 1.162 0.245 -2.566 10.038
no_of_adults 0.0012 0.046 0.025 0.980 -0.089 0.092
no_of_children -0.1562 0.061 -2.581 0.010 -0.275 -0.038
no_of_weekend_nights -0.0472 0.023 -2.013 0.044 -0.093 -0.001
no_of_week_nights -0.0981 0.014 -7.055 0.000 -0.125 -0.071
required_car_parking_space 1.4909 0.139 10.695 0.000 1.218 1.764
lead_time -0.0158 0.000 -42.927 0.000 -0.016 -0.015
repeated_guest 2.7200 0.727 3.744 0.000 1.296 4.144
no_of_previous_cancellations -0.0403 0.243 -0.166 0.868 -0.518 0.437
no_of_previous_bookings_not_canceled 0.0473 0.127 0.373 0.709 -0.201 0.296
avg_price_per_room -0.0163 0.001 -18.764 0.000 -0.018 -0.015
no_of_special_requests 1.4799 0.034 44.147 0.000 1.414 1.546
date -8.251e-19 2.1e-18 -0.392 0.695 -4.95e-18 3.3e-18
Meal Plan 2 0.0608 0.107 0.567 0.570 -0.149 0.271
Meal Plan 3 -11.7872 193.550 -0.061 0.951 -391.139 367.564
Not Selected -0.2818 0.058 -4.857 0.000 -0.396 -0.168
Room_Type 2 0.2590 0.146 1.775 0.076 -0.027 0.545
Room_Type 3 -0.2673 2.830 -0.094 0.925 -5.814 5.279
Room_Type 4 0.0896 0.058 1.545 0.122 -0.024 0.203
Room_Type 5 0.7380 0.250 2.950 0.003 0.248 1.228
Room_Type 6 0.7452 0.153 4.881 0.000 0.446 1.044
Room_Type 7 1.2699 0.326 3.897 0.000 0.631 1.909
Complementary 34.9279 5.58e+05 6.26e-05 1.000 -1.09e+06 1.09e+06
Corporate 0.9929 0.325 3.055 0.002 0.356 1.630
Offline 2.7508 0.315 8.731 0.000 2.133 3.368
Online 0.3609 0.306 1.178 0.239 -0.239 0.961
========================================================================================================
C:\Users\eliza\anaconda3\lib\site-packages\statsmodels\base\model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
warnings.warn('Inverting hessian failed, no bse or cov_params '
C:\Users\eliza\anaconda3\lib\site-packages\statsmodels\base\model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
warnings.warn('Inverting hessian failed, no bse or cov_params '
P>|z|: The interpretation of p-values is similar to that in Linear Regression.
# predicting on training set
# default threshold is 0.5, if predicted probability is greater than 0.5 the observation will be classified as 1
pred_train = lg.predict(X_train) > 0.5
pred_train = np.round(pred_train)
cm = confusion_matrix(y_train, pred_train)
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
print("Accuracy on training set : ", accuracy_score(y_train, pred_train))
Accuracy on training set : 0.8078679504814306
# let's check the VIF of the predictors
vif_series = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
index=X_train.columns,
dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: const 0.000000 no_of_adults 1.004615 no_of_children 1.000130 no_of_weekend_nights 1.000457 no_of_week_nights 1.000624 required_car_parking_space 0.999985 lead_time 1.003908 repeated_guest 0.999965 no_of_previous_cancellations 0.999975 no_of_previous_bookings_not_canceled 1.000035 avg_price_per_room 1.008965 no_of_special_requests 1.001272 date 1.228558 Meal Plan 2 0.999966 Meal Plan 3 0.999997 Not Selected 1.000715 Room_Type 2 0.999929 Room_Type 3 0.999999 Room_Type 4 1.000670 Room_Type 5 1.000030 Room_Type 6 1.000032 Room_Type 7 1.000024 Complementary 0.999907 Corporate 0.999816 Offline 0.999413 Online 1.002986 dtype: float64
# summary of initial logistic regression model
print(lg.summary())
Logit Regression Results
==============================================================================
Dep. Variable: Not_Canceled No. Observations: 18175
Model: Logit Df Residuals: 18149
Method: MLE Df Model: 25
Date: Fri, 13 May 2022 Pseudo R-squ.: inf
Time: 11:15:39 Log-Likelihood: -inf
converged: False LL-Null: 0.0000
Covariance Type: nonrobust LLR p-value: 1.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const 3.7360 3.215 1.162 0.245 -2.566 10.038
no_of_adults 0.0012 0.046 0.025 0.980 -0.089 0.092
no_of_children -0.1562 0.061 -2.581 0.010 -0.275 -0.038
no_of_weekend_nights -0.0472 0.023 -2.013 0.044 -0.093 -0.001
no_of_week_nights -0.0981 0.014 -7.055 0.000 -0.125 -0.071
required_car_parking_space 1.4909 0.139 10.695 0.000 1.218 1.764
lead_time -0.0158 0.000 -42.927 0.000 -0.016 -0.015
repeated_guest 2.7200 0.727 3.744 0.000 1.296 4.144
no_of_previous_cancellations -0.0403 0.243 -0.166 0.868 -0.518 0.437
no_of_previous_bookings_not_canceled 0.0473 0.127 0.373 0.709 -0.201 0.296
avg_price_per_room -0.0163 0.001 -18.764 0.000 -0.018 -0.015
no_of_special_requests 1.4799 0.034 44.147 0.000 1.414 1.546
date -8.251e-19 2.1e-18 -0.392 0.695 -4.95e-18 3.3e-18
Meal Plan 2 0.0608 0.107 0.567 0.570 -0.149 0.271
Meal Plan 3 -11.7872 193.550 -0.061 0.951 -391.139 367.564
Not Selected -0.2818 0.058 -4.857 0.000 -0.396 -0.168
Room_Type 2 0.2590 0.146 1.775 0.076 -0.027 0.545
Room_Type 3 -0.2673 2.830 -0.094 0.925 -5.814 5.279
Room_Type 4 0.0896 0.058 1.545 0.122 -0.024 0.203
Room_Type 5 0.7380 0.250 2.950 0.003 0.248 1.228
Room_Type 6 0.7452 0.153 4.881 0.000 0.446 1.044
Room_Type 7 1.2699 0.326 3.897 0.000 0.631 1.909
Complementary 34.9279 5.58e+05 6.26e-05 1.000 -1.09e+06 1.09e+06
Corporate 0.9929 0.325 3.055 0.002 0.356 1.630
Offline 2.7508 0.315 8.731 0.000 2.133 3.368
Online 0.3609 0.306 1.178 0.239 -0.239 0.961
========================================================================================================
X_train1 = X_train.drop("Complementary", axis=1)
# fitting the model on training set
logit1 = sm.Logit(y_train, X_train1.astype(float))
lg1 = logit1.fit()
pred_train1 = lg1.predict(X_train1)
pred_train1 = np.round(pred_train1)
print("Accuracy on training set : ", accuracy_score(y_train, pred_train1))
print(lg1.summary())
Optimization terminated successfully.
Current function value: inf
Iterations 11
Accuracy on training set : 0.8079779917469051
Logit Regression Results
==============================================================================
Dep. Variable: Not_Canceled No. Observations: 18175
Model: Logit Df Residuals: 18150
Method: MLE Df Model: 24
Date: Fri, 13 May 2022 Pseudo R-squ.: inf
Time: 11:15:59 Log-Likelihood: -inf
converged: True LL-Null: 0.0000
Covariance Type: nonrobust LLR p-value: 1.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const 4.3929 3.207 1.370 0.171 -1.892 10.678
no_of_adults 0.0141 0.046 0.306 0.760 -0.076 0.104
no_of_children -0.1497 0.061 -2.473 0.013 -0.268 -0.031
no_of_weekend_nights -0.0500 0.023 -2.132 0.033 -0.096 -0.004
no_of_week_nights -0.0995 0.014 -7.159 0.000 -0.127 -0.072
required_car_parking_space 1.4918 0.139 10.696 0.000 1.218 1.765
lead_time -0.0158 0.000 -43.024 0.000 -0.016 -0.015
repeated_guest 2.6692 0.730 3.656 0.000 1.238 4.100
no_of_previous_cancellations -0.0396 0.242 -0.164 0.870 -0.514 0.435
no_of_previous_bookings_not_canceled 0.0508 0.129 0.393 0.694 -0.202 0.304
avg_price_per_room -0.0167 0.001 -19.434 0.000 -0.018 -0.015
no_of_special_requests 1.4831 0.034 44.242 0.000 1.417 1.549
date -8.124e-19 2.1e-18 -0.387 0.699 -4.93e-18 3.3e-18
Meal Plan 2 0.0730 0.107 0.682 0.496 -0.137 0.283
Meal Plan 3 -2.6054 1.750 -1.489 0.137 -6.035 0.824
Not Selected -0.2883 0.058 -4.969 0.000 -0.402 -0.175
Room_Type 2 0.2583 0.146 1.771 0.077 -0.028 0.544
Room_Type 3 -0.2609 2.835 -0.092 0.927 -5.816 5.295
Room_Type 4 0.0878 0.058 1.513 0.130 -0.026 0.201
Room_Type 5 0.7576 0.250 3.035 0.002 0.268 1.247
Room_Type 6 0.7610 0.153 4.985 0.000 0.462 1.060
Room_Type 7 1.3123 0.325 4.034 0.000 0.675 1.950
Corporate 0.3411 0.294 1.161 0.246 -0.235 0.917
Offline 2.0919 0.281 7.444 0.000 1.541 2.643
Online -0.2907 0.273 -1.066 0.286 -0.825 0.244
========================================================================================================
C:\Users\eliza\anaconda3\lib\site-packages\statsmodels\base\model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
warnings.warn('Inverting hessian failed, no bse or cov_params '
C:\Users\eliza\anaconda3\lib\site-packages\statsmodels\base\model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
warnings.warn('Inverting hessian failed, no bse or cov_params '
X_train2 = X_train1.drop("Room_Type 3", axis=1)
# fitting the model on training set
logit2 = sm.Logit(y_train, X_train2.astype(float))
lg2 = logit2.fit()
pred_train2 = lg2.predict(X_train2)
pred_train2 = np.round(pred_train2)
print("Accuracy on training set : ", accuracy_score(y_train, pred_train1))
print(lg2.summary())
Optimization terminated successfully.
Current function value: inf
Iterations 11
Accuracy on training set : 0.8079779917469051
Logit Regression Results
==============================================================================
Dep. Variable: Not_Canceled No. Observations: 18175
Model: Logit Df Residuals: 18151
Method: MLE Df Model: 23
Date: Fri, 13 May 2022 Pseudo R-squ.: inf
Time: 11:16:39 Log-Likelihood: -inf
converged: True LL-Null: 0.0000
Covariance Type: nonrobust LLR p-value: 1.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const 4.3919 3.207 1.370 0.171 -1.893 10.677
no_of_adults 0.0141 0.046 0.307 0.759 -0.076 0.104
no_of_children -0.1496 0.061 -2.473 0.013 -0.268 -0.031
no_of_weekend_nights -0.0500 0.023 -2.132 0.033 -0.096 -0.004
no_of_week_nights -0.0995 0.014 -7.159 0.000 -0.127 -0.072
required_car_parking_space 1.4918 0.139 10.696 0.000 1.218 1.765
lead_time -0.0158 0.000 -43.026 0.000 -0.016 -0.015
repeated_guest 2.6692 0.730 3.656 0.000 1.238 4.100
no_of_previous_cancellations -0.0396 0.242 -0.164 0.870 -0.514 0.435
no_of_previous_bookings_not_canceled 0.0508 0.129 0.393 0.694 -0.202 0.304
avg_price_per_room -0.0167 0.001 -19.435 0.000 -0.018 -0.015
no_of_special_requests 1.4831 0.034 44.243 0.000 1.417 1.549
date -8.118e-19 2.1e-18 -0.386 0.699 -4.93e-18 3.31e-18
Meal Plan 2 0.0731 0.107 0.682 0.495 -0.137 0.283
Meal Plan 3 -2.6054 1.750 -1.489 0.137 -6.035 0.825
Not Selected -0.2883 0.058 -4.968 0.000 -0.402 -0.175
Room_Type 2 0.2583 0.146 1.771 0.077 -0.028 0.544
Room_Type 4 0.0878 0.058 1.514 0.130 -0.026 0.201
Room_Type 5 0.7576 0.250 3.035 0.002 0.268 1.247
Room_Type 6 0.7611 0.153 4.986 0.000 0.462 1.060
Room_Type 7 1.3123 0.325 4.034 0.000 0.675 1.950
Corporate 0.3411 0.294 1.161 0.246 -0.235 0.917
Offline 2.0918 0.281 7.444 0.000 1.541 2.643
Online -0.2907 0.273 -1.066 0.286 -0.825 0.244
========================================================================================================
C:\Users\eliza\anaconda3\lib\site-packages\statsmodels\base\model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
warnings.warn('Inverting hessian failed, no bse or cov_params '
C:\Users\eliza\anaconda3\lib\site-packages\statsmodels\base\model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
warnings.warn('Inverting hessian failed, no bse or cov_params '
X_train3 = X_train2.drop("no_of_previous_cancellations", axis=1)
# fitting the model on training set
logit3 = sm.Logit(y_train, X_train3.astype(float))
lg3 = logit3.fit()
pred_train3 = lg3.predict(X_train3)
pred_train3 = np.round(pred_train3)
print("Accuracy on training set : ", accuracy_score(y_train, pred_train1))
print(lg3.summary())
Optimization terminated successfully.
Current function value: inf
Iterations 11
Accuracy on training set : 0.8079779917469051
Logit Regression Results
==============================================================================
Dep. Variable: Not_Canceled No. Observations: 18175
Model: Logit Df Residuals: 18152
Method: MLE Df Model: 22
Date: Fri, 13 May 2022 Pseudo R-squ.: inf
Time: 11:22:35 Log-Likelihood: -inf
converged: True LL-Null: 0.0000
Covariance Type: nonrobust LLR p-value: 1.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const 4.3869 3.207 1.368 0.171 -1.898 10.672
no_of_adults 0.0141 0.046 0.307 0.759 -0.076 0.104
no_of_children -0.1496 0.061 -2.473 0.013 -0.268 -0.031
no_of_weekend_nights -0.0500 0.023 -2.131 0.033 -0.096 -0.004
no_of_week_nights -0.0995 0.014 -7.159 0.000 -0.127 -0.072
required_car_parking_space 1.4919 0.139 10.696 0.000 1.218 1.765
lead_time -0.0158 0.000 -43.026 0.000 -0.016 -0.015
repeated_guest 2.6514 0.717 3.699 0.000 1.247 4.056
no_of_previous_bookings_not_canceled 0.0453 0.123 0.369 0.712 -0.196 0.286
avg_price_per_room -0.0167 0.001 -19.435 0.000 -0.018 -0.015
no_of_special_requests 1.4831 0.034 44.243 0.000 1.417 1.549
date -8.084e-19 2.1e-18 -0.385 0.700 -4.93e-18 3.31e-18
Meal Plan 2 0.0731 0.107 0.682 0.495 -0.137 0.283
Meal Plan 3 -2.6052 1.750 -1.489 0.137 -6.035 0.825
Not Selected -0.2883 0.058 -4.969 0.000 -0.402 -0.175
Room_Type 2 0.2584 0.146 1.771 0.076 -0.028 0.544
Room_Type 4 0.0878 0.058 1.514 0.130 -0.026 0.201
Room_Type 5 0.7576 0.250 3.035 0.002 0.268 1.247
Room_Type 6 0.7611 0.153 4.986 0.000 0.462 1.060
Room_Type 7 1.3123 0.325 4.034 0.000 0.675 1.950
Corporate 0.3415 0.294 1.163 0.245 -0.234 0.917
Offline 2.0915 0.281 7.443 0.000 1.541 2.642
Online -0.2910 0.273 -1.067 0.286 -0.825 0.243
========================================================================================================
C:\Users\eliza\anaconda3\lib\site-packages\statsmodels\base\model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
warnings.warn('Inverting hessian failed, no bse or cov_params '
C:\Users\eliza\anaconda3\lib\site-packages\statsmodels\base\model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
warnings.warn('Inverting hessian failed, no bse or cov_params '
X_train4 = X_train3.drop("no_of_adults", axis=1)
# fitting the model on training set
logit4 = sm.Logit(y_train, X_train4.astype(float))
lg4 = logit4.fit()
pred_train4 = lg4.predict(X_train4)
pred_train4 = np.round(pred_train4)
print("Accuracy on training set : ", accuracy_score(y_train, pred_train1))
print(lg4.summary())
Optimization terminated successfully.
Current function value: inf
Iterations 11
Accuracy on training set : 0.8079779917469051
Logit Regression Results
==============================================================================
Dep. Variable: Not_Canceled No. Observations: 18175
Model: Logit Df Residuals: 18153
Method: MLE Df Model: 21
Date: Fri, 13 May 2022 Pseudo R-squ.: inf
Time: 11:23:33 Log-Likelihood: -inf
converged: True LL-Null: 0.0000
Covariance Type: nonrobust LLR p-value: 1.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const 4.4396 3.202 1.387 0.166 -1.836 10.715
no_of_children -0.1522 0.060 -2.541 0.011 -0.270 -0.035
no_of_weekend_nights -0.0499 0.023 -2.127 0.033 -0.096 -0.004
no_of_week_nights -0.0994 0.014 -7.157 0.000 -0.127 -0.072
required_car_parking_space 1.4928 0.139 10.704 0.000 1.219 1.766
lead_time -0.0158 0.000 -43.389 0.000 -0.016 -0.015
repeated_guest 2.6497 0.717 3.697 0.000 1.245 4.054
no_of_previous_bookings_not_canceled 0.0454 0.123 0.369 0.712 -0.196 0.286
avg_price_per_room -0.0166 0.001 -19.813 0.000 -0.018 -0.015
no_of_special_requests 1.4838 0.033 44.362 0.000 1.418 1.549
date -8.353e-19 2.1e-18 -0.398 0.691 -4.95e-18 3.28e-18
Meal Plan 2 0.0717 0.107 0.670 0.503 -0.138 0.282
Meal Plan 3 -2.6014 1.746 -1.490 0.136 -6.023 0.821
Not Selected -0.2871 0.058 -4.959 0.000 -0.401 -0.174
Room_Type 2 0.2558 0.146 1.757 0.079 -0.029 0.541
Room_Type 4 0.0918 0.057 1.623 0.105 -0.019 0.203
Room_Type 5 0.7568 0.249 3.034 0.002 0.268 1.246
Room_Type 6 0.7631 0.152 5.006 0.000 0.464 1.062
Room_Type 7 1.3182 0.325 4.060 0.000 0.682 1.955
Corporate 0.3415 0.294 1.163 0.245 -0.234 0.917
Offline 2.0989 0.280 7.501 0.000 1.550 2.647
Online -0.2838 0.271 -1.045 0.296 -0.816 0.248
========================================================================================================
C:\Users\eliza\anaconda3\lib\site-packages\statsmodels\base\model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
warnings.warn('Inverting hessian failed, no bse or cov_params '
C:\Users\eliza\anaconda3\lib\site-packages\statsmodels\base\model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
warnings.warn('Inverting hessian failed, no bse or cov_params '
X_train5 = X_train4.drop("no_of_previous_bookings_not_canceled", axis=1)
# fitting the model on training set
logit5 = sm.Logit(y_train, X_train5.astype(float))
lg5 = logit5.fit()
pred_train5 = lg5.predict(X_train5)
pred_train5 = np.round(pred_train5)
print("Accuracy on training set : ", accuracy_score(y_train, pred_train1))
print(lg5.summary())
Optimization terminated successfully.
Current function value: inf
Iterations 9
Accuracy on training set : 0.8079779917469051
Logit Regression Results
==============================================================================
Dep. Variable: Not_Canceled No. Observations: 18175
Model: Logit Df Residuals: 18154
Method: MLE Df Model: 20
Date: Fri, 13 May 2022 Pseudo R-squ.: inf
Time: 15:41:17 Log-Likelihood: -inf
converged: True LL-Null: 0.0000
Covariance Type: nonrobust LLR p-value: 1.000
==============================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------------------
const 4.4248 3.202 1.382 0.167 -1.850 10.700
no_of_children -0.1522 0.060 -2.541 0.011 -0.270 -0.035
no_of_weekend_nights -0.0499 0.023 -2.126 0.033 -0.096 -0.004
no_of_week_nights -0.0994 0.014 -7.157 0.000 -0.127 -0.072
required_car_parking_space 1.4928 0.139 10.704 0.000 1.219 1.766
lead_time -0.0158 0.000 -43.390 0.000 -0.016 -0.015
repeated_guest 2.8337 0.588 4.819 0.000 1.681 3.986
avg_price_per_room -0.0166 0.001 -19.816 0.000 -0.018 -0.015
no_of_special_requests 1.4838 0.033 44.364 0.000 1.418 1.549
date -8.264e-19 2.1e-18 -0.394 0.694 -4.94e-18 3.29e-18
Meal Plan 2 0.0718 0.107 0.671 0.502 -0.138 0.282
Meal Plan 3 -2.6012 1.746 -1.490 0.136 -6.024 0.822
Not Selected -0.2872 0.058 -4.961 0.000 -0.401 -0.174
Room_Type 2 0.2558 0.146 1.757 0.079 -0.030 0.541
Room_Type 4 0.0917 0.057 1.622 0.105 -0.019 0.203
Room_Type 5 0.7568 0.249 3.034 0.002 0.268 1.246
Room_Type 6 0.7633 0.152 5.007 0.000 0.465 1.062
Room_Type 7 1.3184 0.325 4.060 0.000 0.682 1.955
Corporate 0.3442 0.294 1.172 0.241 -0.231 0.920
Offline 2.1004 0.280 7.506 0.000 1.552 2.649
Online -0.2822 0.271 -1.040 0.299 -0.814 0.250
==============================================================================================
C:\Users\eliza\anaconda3\lib\site-packages\statsmodels\base\model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
warnings.warn('Inverting hessian failed, no bse or cov_params '
C:\Users\eliza\anaconda3\lib\site-packages\statsmodels\base\model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
warnings.warn('Inverting hessian failed, no bse or cov_params '
X_train6 = X_train5.drop("date", axis=1)
# fitting the model on training set
logit6 = sm.Logit(y_train, X_train6.astype(float))
lg6 = logit6.fit()
pred_train6 = lg6.predict(X_train6)
pred_train6 = np.round(pred_train6)
print("Accuracy on training set : ", accuracy_score(y_train, pred_train1))
print(lg6.summary())
Optimization terminated successfully.
Current function value: inf
Iterations 9
Accuracy on training set : 0.8079779917469051
Logit Regression Results
==============================================================================
Dep. Variable: Not_Canceled No. Observations: 18175
Model: Logit Df Residuals: 18155
Method: MLE Df Model: 19
Date: Fri, 13 May 2022 Pseudo R-squ.: inf
Time: 15:43:53 Log-Likelihood: -inf
converged: True LL-Null: 0.0000
Covariance Type: nonrobust LLR p-value: 1.000
==============================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------------------
const 3.1685 0.272 11.629 0.000 2.634 3.703
no_of_children -0.1525 0.060 -2.547 0.011 -0.270 -0.035
no_of_weekend_nights -0.0500 0.023 -2.132 0.033 -0.096 -0.004
no_of_week_nights -0.0993 0.014 -7.151 0.000 -0.127 -0.072
required_car_parking_space 1.4955 0.139 10.738 0.000 1.223 1.769
lead_time -0.0158 0.000 -45.639 0.000 -0.016 -0.015
repeated_guest 2.8300 0.588 4.814 0.000 1.678 3.982
avg_price_per_room -0.0167 0.001 -20.739 0.000 -0.018 -0.015
no_of_special_requests 1.4830 0.033 44.437 0.000 1.418 1.548
Meal Plan 2 0.0746 0.107 0.698 0.485 -0.135 0.284
Meal Plan 3 -2.5909 1.754 -1.477 0.140 -6.028 0.847
Not Selected -0.2915 0.057 -5.130 0.000 -0.403 -0.180
Room_Type 2 0.2569 0.145 1.766 0.077 -0.028 0.542
Room_Type 4 0.0909 0.057 1.609 0.108 -0.020 0.202
Room_Type 5 0.7549 0.249 3.028 0.002 0.266 1.244
Room_Type 6 0.7685 0.152 5.060 0.000 0.471 1.066
Room_Type 7 1.3220 0.325 4.072 0.000 0.686 1.958
Corporate 0.3515 0.293 1.199 0.230 -0.223 0.926
Offline 2.1076 0.279 7.547 0.000 1.560 2.655
Online -0.2745 0.271 -1.014 0.311 -0.805 0.256
==============================================================================================
C:\Users\eliza\anaconda3\lib\site-packages\statsmodels\base\model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
warnings.warn('Inverting hessian failed, no bse or cov_params '
C:\Users\eliza\anaconda3\lib\site-packages\statsmodels\base\model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
warnings.warn('Inverting hessian failed, no bse or cov_params '
X_train7 = X_train6.drop("Meal Plan 2", axis=1)
# fitting the model on training set
logit7 = sm.Logit(y_train, X_train7.astype(float))
lg7 = logit7.fit()
pred_train7 = lg7.predict(X_train7)
pred_train7 = np.round(pred_train7)
print("Accuracy on training set : ", accuracy_score(y_train, pred_train1))
print(lg7.summary())
Optimization terminated successfully.
Current function value: inf
Iterations 9
Accuracy on training set : 0.8079779917469051
Logit Regression Results
==============================================================================
Dep. Variable: Not_Canceled No. Observations: 18175
Model: Logit Df Residuals: 18156
Method: MLE Df Model: 18
Date: Fri, 13 May 2022 Pseudo R-squ.: inf
Time: 15:52:14 Log-Likelihood: -inf
converged: True LL-Null: 0.0000
Covariance Type: nonrobust LLR p-value: 1.000
==============================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------------------
const 3.1612 0.272 11.619 0.000 2.628 3.694
no_of_children -0.1523 0.060 -2.544 0.011 -0.270 -0.035
no_of_weekend_nights -0.0496 0.023 -2.115 0.034 -0.096 -0.004
no_of_week_nights -0.0991 0.014 -7.135 0.000 -0.126 -0.072
required_car_parking_space 1.4954 0.139 10.737 0.000 1.222 1.768
lead_time -0.0158 0.000 -45.747 0.000 -0.016 -0.015
repeated_guest 2.8314 0.588 4.816 0.000 1.679 3.984
avg_price_per_room -0.0166 0.001 -21.102 0.000 -0.018 -0.015
no_of_special_requests 1.4828 0.033 44.431 0.000 1.417 1.548
Meal Plan 3 -2.5917 1.742 -1.488 0.137 -6.005 0.822
Not Selected -0.2931 0.057 -5.161 0.000 -0.404 -0.182
Room_Type 2 0.2545 0.145 1.751 0.080 -0.030 0.539
Room_Type 4 0.0875 0.056 1.555 0.120 -0.023 0.198
Room_Type 5 0.7498 0.249 3.008 0.003 0.261 1.238
Room_Type 6 0.7596 0.151 5.019 0.000 0.463 1.056
Room_Type 7 1.3061 0.324 4.035 0.000 0.672 1.940
Corporate 0.3471 0.293 1.185 0.236 -0.227 0.921
Offline 2.1124 0.279 7.569 0.000 1.565 2.659
Online -0.2794 0.271 -1.033 0.302 -0.810 0.251
==============================================================================================
C:\Users\eliza\anaconda3\lib\site-packages\statsmodels\base\model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
warnings.warn('Inverting hessian failed, no bse or cov_params '
C:\Users\eliza\anaconda3\lib\site-packages\statsmodels\base\model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
warnings.warn('Inverting hessian failed, no bse or cov_params '
X_train8 = X_train7.drop("Online", axis=1)
# fitting the model on training set
logit8 = sm.Logit(y_train, X_train8.astype(float))
lg8 = logit8.fit()
pred_train8 = lg8.predict(X_train8)
pred_train8 = np.round(pred_train8)
print("Accuracy on training set : ", accuracy_score(y_train, pred_train1))
print(lg8.summary())
Optimization terminated successfully.
Current function value: inf
Iterations 9
Accuracy on training set : 0.8079779917469051
Logit Regression Results
==============================================================================
Dep. Variable: Not_Canceled No. Observations: 18175
Model: Logit Df Residuals: 18157
Method: MLE Df Model: 17
Date: Fri, 13 May 2022 Pseudo R-squ.: inf
Time: 15:54:01 Log-Likelihood: -inf
converged: True LL-Null: 0.0000
Covariance Type: nonrobust LLR p-value: 1.000
==============================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------------------
const 2.9029 0.102 28.429 0.000 2.703 3.103
no_of_children -0.1514 0.060 -2.529 0.011 -0.269 -0.034
no_of_weekend_nights -0.0500 0.023 -2.130 0.033 -0.096 -0.004
no_of_week_nights -0.0993 0.014 -7.154 0.000 -0.127 -0.072
required_car_parking_space 1.4979 0.139 10.754 0.000 1.225 1.771
lead_time -0.0158 0.000 -46.020 0.000 -0.016 -0.015
repeated_guest 2.8522 0.587 4.855 0.000 1.701 4.003
avg_price_per_room -0.0167 0.001 -21.618 0.000 -0.018 -0.015
no_of_special_requests 1.4820 0.033 44.418 0.000 1.417 1.547
Meal Plan 3 -2.4576 1.673 -1.469 0.142 -5.737 0.822
Not Selected -0.2977 0.057 -5.259 0.000 -0.409 -0.187
Room_Type 2 0.2524 0.145 1.736 0.083 -0.033 0.537
Room_Type 4 0.0917 0.056 1.632 0.103 -0.018 0.202
Room_Type 5 0.7552 0.249 3.032 0.002 0.267 1.243
Room_Type 6 0.7663 0.151 5.068 0.000 0.470 1.063
Room_Type 7 1.3190 0.323 4.078 0.000 0.685 1.953
Corporate 0.6186 0.128 4.849 0.000 0.369 0.869
Offline 2.3882 0.080 29.807 0.000 2.231 2.545
==============================================================================================
C:\Users\eliza\anaconda3\lib\site-packages\statsmodels\base\model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
warnings.warn('Inverting hessian failed, no bse or cov_params '
C:\Users\eliza\anaconda3\lib\site-packages\statsmodels\base\model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
warnings.warn('Inverting hessian failed, no bse or cov_params '
X_train9 = X_train8.drop("Meal Plan 3", axis=1)
# fitting the model on training set
logit9 = sm.Logit(y_train, X_train9.astype(float))
lg9 = logit9.fit()
pred_train9 = lg9.predict(X_train9)
pred_train9 = np.round(pred_train9)
print("Accuracy on training set : ", accuracy_score(y_train, pred_train1))
print(lg9.summary())
Optimization terminated successfully.
Current function value: inf
Iterations 9
Accuracy on training set : 0.8079779917469051
C:\Users\eliza\anaconda3\lib\site-packages\statsmodels\base\model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
warnings.warn('Inverting hessian failed, no bse or cov_params '
C:\Users\eliza\anaconda3\lib\site-packages\statsmodels\base\model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
warnings.warn('Inverting hessian failed, no bse or cov_params '
Logit Regression Results
==============================================================================
Dep. Variable: Not_Canceled No. Observations: 18175
Model: Logit Df Residuals: 18158
Method: MLE Df Model: 16
Date: Fri, 13 May 2022 Pseudo R-squ.: inf
Time: 15:55:00 Log-Likelihood: -inf
converged: True LL-Null: 0.0000
Covariance Type: nonrobust LLR p-value: 1.000
==============================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------------------
const 2.9030 0.102 28.396 0.000 2.703 3.103
no_of_children -0.1506 0.060 -2.515 0.012 -0.268 -0.033
no_of_weekend_nights -0.0497 0.023 -2.119 0.034 -0.096 -0.004
no_of_week_nights -0.0992 0.014 -7.149 0.000 -0.126 -0.072
required_car_parking_space 1.4974 0.139 10.750 0.000 1.224 1.770
lead_time -0.0158 0.000 -46.014 0.000 -0.016 -0.015
repeated_guest 2.8515 0.587 4.854 0.000 1.700 4.003
avg_price_per_room -0.0168 0.001 -21.599 0.000 -0.018 -0.015
no_of_special_requests 1.4821 0.033 44.422 0.000 1.417 1.547
Not Selected -0.2975 0.057 -5.255 0.000 -0.408 -0.187
Room_Type 2 0.2518 0.145 1.732 0.083 -0.033 0.537
Room_Type 4 0.0922 0.056 1.641 0.101 -0.018 0.202
Room_Type 5 0.7559 0.249 3.035 0.002 0.268 1.244
Room_Type 6 0.7662 0.151 5.065 0.000 0.470 1.063
Room_Type 7 1.3038 0.324 4.023 0.000 0.669 1.939
Corporate 0.6191 0.128 4.853 0.000 0.369 0.869
Offline 2.3853 0.080 29.794 0.000 2.228 2.542
==============================================================================================
X_train10 = X_train9.drop("Room_Type 4", axis=1)
# fitting the model on training set
logit10 = sm.Logit(y_train, X_train10.astype(float))
lg10 = logit10.fit()
pred_train10 = lg10.predict(X_train10)
pred_train10 = np.round(pred_train10)
print("Accuracy on training set : ", accuracy_score(y_train, pred_train1))
print(lg10.summary())
Optimization terminated successfully.
Current function value: inf
Iterations 9
Accuracy on training set : 0.8079779917469051
Logit Regression Results
==============================================================================
Dep. Variable: Not_Canceled No. Observations: 18175
Model: Logit Df Residuals: 18159
Method: MLE Df Model: 15
Date: Fri, 13 May 2022 Pseudo R-squ.: inf
Time: 15:56:12 Log-Likelihood: -inf
converged: True LL-Null: 0.0000
Covariance Type: nonrobust LLR p-value: 1.000
==============================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------------------
const 2.8793 0.101 28.510 0.000 2.681 3.077
no_of_children -0.1626 0.059 -2.746 0.006 -0.279 -0.047
no_of_weekend_nights -0.0491 0.023 -2.095 0.036 -0.095 -0.003
no_of_week_nights -0.0970 0.014 -7.025 0.000 -0.124 -0.070
required_car_parking_space 1.4915 0.139 10.710 0.000 1.219 1.764
lead_time -0.0158 0.000 -46.033 0.000 -0.016 -0.015
repeated_guest 2.8628 0.587 4.874 0.000 1.712 4.014
avg_price_per_room -0.0163 0.001 -22.535 0.000 -0.018 -0.015
no_of_special_requests 1.4814 0.033 44.413 0.000 1.416 1.547
Not Selected -0.3203 0.055 -5.834 0.000 -0.428 -0.213
Room_Type 2 0.2365 0.145 1.632 0.103 -0.048 0.521
Room_Type 5 0.7209 0.248 2.906 0.004 0.235 1.207
Room_Type 6 0.7223 0.149 4.859 0.000 0.431 1.014
Room_Type 7 1.2388 0.321 3.857 0.000 0.609 1.868
Corporate 0.6052 0.127 4.757 0.000 0.356 0.855
Offline 2.3740 0.080 29.800 0.000 2.218 2.530
==============================================================================================
C:\Users\eliza\anaconda3\lib\site-packages\statsmodels\base\model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
warnings.warn('Inverting hessian failed, no bse or cov_params '
C:\Users\eliza\anaconda3\lib\site-packages\statsmodels\base\model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
warnings.warn('Inverting hessian failed, no bse or cov_params '
X_train11 = X_train10.drop("Room_Type 2", axis=1)
# fitting the model on training set
logit11 = sm.Logit(y_train, X_train11.astype(float))
lg11 = logit11.fit()
pred_train11 = lg11.predict(X_train11)
pred_train11 = np.round(pred_train11)
print("Accuracy on training set : ", accuracy_score(y_train, pred_train1))
print(lg11.summary())
Optimization terminated successfully.
Current function value: inf
Iterations 9
Accuracy on training set : 0.8079779917469051
Logit Regression Results
==============================================================================
Dep. Variable: Not_Canceled No. Observations: 18175
Model: Logit Df Residuals: 18160
Method: MLE Df Model: 14
Date: Fri, 13 May 2022 Pseudo R-squ.: inf
Time: 15:57:40 Log-Likelihood: -inf
converged: True LL-Null: 0.0000
Covariance Type: nonrobust LLR p-value: 1.000
==============================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------------------
const 2.8982 0.100 28.867 0.000 2.701 3.095
no_of_children -0.1379 0.057 -2.400 0.016 -0.251 -0.025
no_of_weekend_nights -0.0494 0.023 -2.107 0.035 -0.095 -0.003
no_of_week_nights -0.0972 0.014 -7.042 0.000 -0.124 -0.070
required_car_parking_space 1.4972 0.139 10.748 0.000 1.224 1.770
lead_time -0.0158 0.000 -46.018 0.000 -0.016 -0.015
repeated_guest 2.8604 0.587 4.870 0.000 1.709 4.012
avg_price_per_room -0.0164 0.001 -22.822 0.000 -0.018 -0.015
no_of_special_requests 1.4798 0.033 44.400 0.000 1.415 1.545
Not Selected -0.3272 0.055 -5.978 0.000 -0.434 -0.220
Room_Type 5 0.7172 0.248 2.892 0.004 0.231 1.203
Room_Type 6 0.6835 0.147 4.654 0.000 0.396 0.971
Room_Type 7 1.2232 0.321 3.816 0.000 0.595 1.852
Corporate 0.5974 0.127 4.699 0.000 0.348 0.847
Offline 2.3649 0.079 29.775 0.000 2.209 2.521
==============================================================================================
C:\Users\eliza\anaconda3\lib\site-packages\statsmodels\base\model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
warnings.warn('Inverting hessian failed, no bse or cov_params '
C:\Users\eliza\anaconda3\lib\site-packages\statsmodels\base\model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
warnings.warn('Inverting hessian failed, no bse or cov_params '
Now no feature has p-value greater than 0.05, so we'll consider the features in X_train11 as the final ones and lg11 as final model.
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred_temp = model.predict(predictors) > threshold
# rounding off the above values to get classes
pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# this will help in making the Python code more structured automatically (good coding practice)
#%load_ext nb_black
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Library to split data
from sklearn.model_selection import train_test_split
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.linear_model import LogisticRegression
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
precision_recall_curve,
roc_curve,
)
# predicting on training set
# default threshold is 0.5, if predicted probability is greater than 0.5 the observation will be classified as 1
pred_train = lg.predict(X_train) > 0.5
pred_train = np.round(pred_train)
cm = confusion_matrix(y_train, pred_train)
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
log_reg_model_train_perf = model_performance_classification_statsmodels(
lg11, X_train11, y_train
)
print("Training performance:")
log_reg_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.808913 | 0.909953 | 0.836618 | 0.871746 |
# converting coefficients to odds
odds = np.exp(lg11.params)
# finding the percentage change
perc_change_odds = (np.exp(lg11.params) - 1) * 100
# removing limit from number of columns to display
pd.set_option("display.max_columns", None)
# adding the odds to a dataframe
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train11.columns).T
| const | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | repeated_guest | avg_price_per_room | no_of_special_requests | Not Selected | Room_Type 5 | Room_Type 6 | Room_Type 7 | Corporate | Offline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Odds | 18.140936 | 0.871146 | 0.951800 | 0.907330 | 4.469278 | 0.984342 | 17.468578 | 0.983708 | 4.392193 | 0.720941 | 2.048695 | 1.980791 | 3.398057 | 1.817340 | 10.643323 |
| Change_odd% | 1714.093581 | -12.885392 | -4.820007 | -9.266997 | 346.927831 | -1.565844 | 1646.857828 | -1.629182 | 339.219348 | -27.905858 | 104.869540 | 98.079059 | 239.805687 | 81.734044 | 964.332268 |
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors) > threshold
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# creating confusion matrix
# Check Model on training set
confusion_matrix_statsmodels(lg11, X_train11, y_train)
log_reg_model_train_perf = model_performance_classification_statsmodels(
lg11, X_train11, y_train
)
print("Training performance:")
log_reg_model_train_perf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.808913 | 0.909953 | 0.836618 | 0.871746 |
ROC-AUC on training set
logit_roc_auc_train = roc_auc_score(y_train, lg11.predict(X_train11))
fpr, tpr, thresholds = roc_curve(y_train, lg11.predict(X_train11))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
Logistic regression gives a good performance on the model.
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg11.predict(X_train11))
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
0.7203978395856949
# creating confusion matrix
confusion_matrix_statsmodels(
lg11, X_train11, y_train, threshold=optimal_threshold_auc_roc
)
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg11, X_train11, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.772545 | 0.76725 | 0.89925 | 0.828022 |
y_scores = lg11.predict(X_train11)
prec, rec, tre = precision_recall_curve(y_train, y_scores,)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
# setting the threshold
optimal_threshold_curve = 0.58
# creating confusion matrix
confusion_matrix_statsmodels(lg11, X_train11, y_train, threshold=optimal_threshold_curve)
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
lg11, X_train11, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.804402 | 0.869478 | 0.858295 | 0.86385 |
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression-default Threshold (0.5)",
"Logistic Regression-0.76 Threshold",
"Logistic Regression-0.58 Threshold",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Logistic Regression-default Threshold (0.5) | Logistic Regression-0.76 Threshold | Logistic Regression-0.58 Threshold | |
|---|---|---|---|
| Accuracy | 0.808913 | 0.772545 | 0.804402 |
| Recall | 0.909953 | 0.767250 | 0.869478 |
| Precision | 0.836618 | 0.899250 | 0.858295 |
| F1 | 0.871746 | 0.828022 | 0.863850 |
X_test11 = X_test[list(X_train11.columns)]
# creating confusion matrix
confusion_matrix_statsmodels(lg11, X_test11, y_test)
log_reg_model_test_perf = model_performance_classification_statsmodels(
lg11, X_test11, y_test
)
print("Test performance:")
log_reg_model_test_perf
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.810526 | 0.915273 | 0.835057 | 0.873326 |
logit_roc_auc_train = roc_auc_score(y_test, lg11.predict(X_test11))
fpr, tpr, thresholds = roc_curve(y_test, lg11.predict(X_test11))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
log_reg_model_test_perf_threshold_curve = model_performance_classification_statsmodels(
lg11, X_test11, y_test, threshold=optimal_threshold_curve
)
print("Test performance:")
log_reg_model_test_perf_threshold_curve
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.808858 | 0.873898 | 0.860432 | 0.867113 |
# creating confusion matrix
confusion_matrix_statsmodels(lg11, X_test11, y_test, threshold=optimal_threshold_auc_roc)
# checking model performance for this model
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg11, X_test11, y_test, threshold=optimal_threshold_auc_roc
)
print("Test performance:")
log_reg_model_test_perf_threshold_auc_roc
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.775353 | 0.77352 | 0.897516 | 0.830918 |
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression-default Threshold (0.5)",
"Logistic Regression-0.76 Threshold",
"Logistic Regression-0.58 Threshold",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Logistic Regression-default Threshold (0.5) | Logistic Regression-0.76 Threshold | Logistic Regression-0.58 Threshold | |
|---|---|---|---|
| Accuracy | 0.808913 | 0.772545 | 0.804402 |
| Recall | 0.909953 | 0.767250 | 0.869478 |
| Precision | 0.836618 | 0.899250 | 0.858295 |
| F1 | 0.871746 | 0.828022 | 0.863850 |
# testing performance comparison
models_test_comp_df = pd.concat(
[
log_reg_model_test_perf.T,
log_reg_model_test_perf_threshold_auc_roc.T,
log_reg_model_test_perf_threshold_curve.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic Regression-default Threshold (0.5)",
"Logistic Regression-0.76 Threshold",
"Logistic Regression-0.58 Threshold",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
| Logistic Regression-default Threshold (0.5) | Logistic Regression-0.76 Threshold | Logistic Regression-0.58 Threshold | |
|---|---|---|---|
| Accuracy | 0.810526 | 0.775353 | 0.808858 |
| Recall | 0.915273 | 0.773520 | 0.873898 |
| Precision | 0.835057 | 0.897516 | 0.860432 |
| F1 | 0.873326 | 0.830918 | 0.867113 |
We have been able to build a predictive model that can be used by the Hotels to find the bookings most likely to cancel with an f1_score of 0.87 and precision of .90 on the training set to identify bookings likely to be canceled.
All the logistic regression models have given a generalized performance on the training and test set.
Coefficient of some levels of required_car_parking_space, repeated_guest, Room Type 5, Room Type 6, Room Type 7, Corporate Customers, Offline bookings and no_of_special_requests are positive an increase in these will lead to increase in chances of a person not canceling their booking.
Coefficient of number of children, number of weekend nights, number of week nights, lead time, avg_price_per_room, and No Selected Meal Plan are negative increase in these will lead to decrease in chances of a person not canceling their booking (aka canceling their booking).
import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Library to split data
from sklearn.model_selection import train_test_split
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# Libraries to build decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To tune different models
from sklearn.model_selection import GridSearchCV
# To perform statistical analysis
import scipy.stats as stats
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
plot_confusion_matrix,
make_scorer,
)
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
pdata.drop(['arrival_year_date'], axis = 1, inplace=True)
pdata.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 25965 entries, 0 to 36273 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 25965 non-null int64 1 no_of_children 25965 non-null int64 2 no_of_weekend_nights 25965 non-null int64 3 no_of_week_nights 25965 non-null int64 4 required_car_parking_space 25965 non-null int64 5 lead_time 25965 non-null int64 6 repeated_guest 25965 non-null int64 7 no_of_previous_cancellations 25965 non-null int64 8 no_of_previous_bookings_not_canceled 25965 non-null int64 9 avg_price_per_room 25965 non-null float64 10 no_of_special_requests 25965 non-null int64 11 date 25965 non-null float64 12 type_of_meal_plan_Meal Plan 1 25965 non-null uint8 13 type_of_meal_plan_Meal Plan 2 25965 non-null uint8 14 type_of_meal_plan_Meal Plan 3 25965 non-null uint8 15 type_of_meal_plan_Not Selected 25965 non-null uint8 16 room_type_reserved_Room_Type 1 25965 non-null uint8 17 room_type_reserved_Room_Type 2 25965 non-null uint8 18 room_type_reserved_Room_Type 3 25965 non-null uint8 19 room_type_reserved_Room_Type 4 25965 non-null uint8 20 room_type_reserved_Room_Type 5 25965 non-null uint8 21 room_type_reserved_Room_Type 6 25965 non-null uint8 22 room_type_reserved_Room_Type 7 25965 non-null uint8 23 market_segment_type_Aviation 25965 non-null uint8 24 market_segment_type_Complementary 25965 non-null uint8 25 market_segment_type_Corporate 25965 non-null uint8 26 market_segment_type_Offline 25965 non-null uint8 27 market_segment_type_Online 25965 non-null uint8 28 booking_status_Canceled 25965 non-null uint8 dtypes: float64(2), int64(10), uint8(17) memory usage: 4.0 MB
X = pdata.drop("booking_status_Canceled" , axis=1)
y = pdata.pop("booking_status_Canceled")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=1)
model = DecisionTreeClassifier(criterion="gini", random_state=1)
model.fit(X_train, y_train)
DecisionTreeClassifier(random_state=1)
decision_tree_perf_train = model_performance_classification_sklearn(
model, X_train, y_train
)
decision_tree_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.996369 | 0.987438 | 1.0 | 0.993679 |
confusion_matrix_sklearn(model, X_train, y_train)
decision_tree_perf_test = model_performance_classification_sklearn(
model, X_test, y_test
)
decision_tree_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.818485 | 0.685465 | 0.672515 | 0.678928 |
feature_names = list(X.columns)
print(feature_names)
['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'required_car_parking_space', 'lead_time', 'repeated_guest', 'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled', 'avg_price_per_room', 'no_of_special_requests', 'date', 'type_of_meal_plan_Meal Plan 1', 'type_of_meal_plan_Meal Plan 2', 'type_of_meal_plan_Meal Plan 3', 'type_of_meal_plan_Not Selected', 'room_type_reserved_Room_Type 1', 'room_type_reserved_Room_Type 2', 'room_type_reserved_Room_Type 3', 'room_type_reserved_Room_Type 4', 'room_type_reserved_Room_Type 5', 'room_type_reserved_Room_Type 6', 'room_type_reserved_Room_Type 7', 'market_segment_type_Aviation', 'market_segment_type_Complementary', 'market_segment_type_Corporate', 'market_segment_type_Offline', 'market_segment_type_Online']
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names=True,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(model, feature_names=feature_names, show_weights=True))
|--- lead_time <= 150.50 | |--- no_of_special_requests <= 0.50 | | |--- market_segment_type_Online <= 0.50 | | | |--- lead_time <= 90.50 | | | | |--- no_of_weekend_nights <= 4.50 | | | | | |--- avg_price_per_room <= 201.50 | | | | | | |--- avg_price_per_room <= 78.10 | | | | | | | |--- date <= 1505476815676768256.00 | | | | | | | | |--- market_segment_type_Corporate <= 0.50 | | | | | | | | | |--- no_of_children <= 0.50 | | | | | | | | | | |--- date <= 1504958395944271872.00 | | | | | | | | | | | |--- weights: [37.00, 0.00] class: 0 | | | | | | | | | | |--- date > 1504958395944271872.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- no_of_children > 0.50 | | | | | | | | | | |--- lead_time <= 35.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | | |--- lead_time > 35.50 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- market_segment_type_Corporate > 0.50 | | | | | | | | | |--- avg_price_per_room <= 66.00 | | | | | | | | | | |--- lead_time <= 11.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- lead_time > 11.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- avg_price_per_room > 66.00 | | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | |--- date > 1505476815676768256.00 | | | | | | | | |--- lead_time <= 17.50 | | | | | | | | | |--- no_of_week_nights <= 0.50 | | | | | | | | | | |--- repeated_guest <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 9 | | | | | | | | | | |--- repeated_guest > 0.50 | | | | | | | | | | | |--- weights: [31.00, 0.00] class: 0 | | | | | | | | | |--- no_of_week_nights > 0.50 | | | | | | | | | | |--- avg_price_per_room <= 74.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- avg_price_per_room > 74.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | |--- lead_time > 17.50 | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | |--- avg_price_per_room <= 68.25 | | | | | | | | | | | |--- truncated branch of depth 7 | | | | | | | | | | |--- avg_price_per_room > 68.25 | | | | | | | | | | | |--- truncated branch of depth 9 | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | |--- room_type_reserved_Room_Type 5 <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 12 | | | | | | | | | | |--- room_type_reserved_Room_Type 5 > 0.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- avg_price_per_room > 78.10 | | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | | |--- date <= 1503575966230773760.00 | | | | | | | | | |--- date <= 1502668800418381824.00 | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | | |--- date > 1502668800418381824.00 | | | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | | | | |--- date > 1503575966230773760.00 | | | | | | | | | |--- lead_time <= 86.50 | | | | | | | | | | |--- repeated_guest <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 15 | | | | | | | | | | |--- repeated_guest > 0.50 | | | | | | | | | | | |--- weights: [58.00, 0.00] class: 0 | | | | | | | | | |--- lead_time > 86.50 | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | |--- weights: [326.00, 0.00] class: 0 | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | |--- lead_time <= 61.50 | | | | | | | | | | |--- date <= 1502150380685885440.00 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | | |--- date > 1502150380685885440.00 | | | | | | | | | | | |--- truncated branch of depth 15 | | | | | | | | | |--- lead_time > 61.50 | | | | | | | | | | |--- avg_price_per_room <= 80.38 | | | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | | | | |--- avg_price_per_room > 80.38 | | | | | | | | | | | |--- truncated branch of depth 9 | | | | | |--- avg_price_per_room > 201.50 | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | |--- no_of_weekend_nights > 4.50 | | | | | |--- weights: [0.00, 4.00] class: 1 | | | |--- lead_time > 90.50 | | | | |--- avg_price_per_room <= 113.53 | | | | | |--- date <= 1510660806843301888.00 | | | | | | |--- date <= 1505260761641910272.00 | | | | | | | |--- avg_price_per_room <= 92.03 | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | |--- weights: [10.00, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 92.03 | | | | | | | | |--- lead_time <= 99.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- lead_time > 99.50 | | | | | | | | | |--- lead_time <= 113.50 | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | | |--- lead_time > 113.50 | | | | | | | | | | |--- date <= 1501848014988247040.00 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | | |--- date > 1501848014988247040.00 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | |--- date > 1505260761641910272.00 | | | | | | | |--- avg_price_per_room <= 108.50 | | | | | | | | |--- lead_time <= 107.50 | | | | | | | | | |--- weights: [0.00, 7.00] class: 1 | | | | | | | | |--- lead_time > 107.50 | | | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | | | |--- no_of_week_nights <= 1.00 | | | | | | | | | | | |--- weights: [1.00, 1.00] class: 0 | | | | | | | | | | |--- no_of_week_nights > 1.00 | | | | | | | | | | | |--- weights: [1.00, 1.00] class: 0 | | | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 108.50 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- date > 1510660806843301888.00 | | | | | | |--- date <= 1518393603401973760.00 | | | | | | | |--- weights: [33.00, 0.00] class: 0 | | | | | | |--- date > 1518393603401973760.00 | | | | | | | |--- avg_price_per_room <= 102.17 | | | | | | | | |--- avg_price_per_room <= 94.89 | | | | | | | | | |--- avg_price_per_room <= 80.12 | | | | | | | | | | |--- date <= 1519732808564604928.00 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- date > 1519732808564604928.00 | | | | | | | | | | | |--- truncated branch of depth 16 | | | | | | | | | |--- avg_price_per_room > 80.12 | | | | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 11 | | | | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- avg_price_per_room > 94.89 | | | | | | | | | |--- lead_time <= 109.00 | | | | | | | | | | |--- avg_price_per_room <= 100.76 | | | | | | | | | | | |--- weights: [0.00, 8.00] class: 1 | | | | | | | | | | |--- avg_price_per_room > 100.76 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | |--- lead_time > 109.00 | | | | | | | | | | |--- lead_time <= 145.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- lead_time > 145.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | |--- avg_price_per_room > 102.17 | | | | | | | | |--- no_of_week_nights <= 0.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- no_of_week_nights > 0.50 | | | | | | | | | |--- weights: [30.00, 0.00] class: 0 | | | | |--- avg_price_per_room > 113.53 | | | | | |--- room_type_reserved_Room_Type 5 <= 0.50 | | | | | | |--- avg_price_per_room <= 131.00 | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | |--- date <= 1528372839655145472.00 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | |--- date > 1528372839655145472.00 | | | | | | | | | | |--- lead_time <= 110.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- lead_time > 110.50 | | | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | |--- lead_time <= 108.50 | | | | | | | | | | |--- date <= 1537401548106104832.00 | | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | | |--- date > 1537401548106104832.00 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | |--- lead_time > 108.50 | | | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | | |--- avg_price_per_room > 131.00 | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | |--- room_type_reserved_Room_Type 5 > 0.50 | | | | | | |--- weights: [0.00, 5.00] class: 1 | | |--- market_segment_type_Online > 0.50 | | | |--- lead_time <= 9.50 | | | | |--- avg_price_per_room <= 202.67 | | | | | |--- lead_time <= 3.50 | | | | | | |--- lead_time <= 2.50 | | | | | | | |--- no_of_week_nights <= 8.50 | | | | | | | | |--- date <= 1503403205466259456.00 | | | | | | | | | |--- avg_price_per_room <= 77.50 | | | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | | | |--- avg_price_per_room > 77.50 | | | | | | | | | | |--- avg_price_per_room <= 133.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | | |--- avg_price_per_room > 133.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | |--- date > 1503403205466259456.00 | | | | | | | | | |--- date <= 1518048013153468416.00 | | | | | | | | | | |--- date <= 1504785635179757568.00 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- date > 1504785635179757568.00 | | | | | | | | | | | |--- weights: [137.00, 0.00] class: 0 | | | | | | | | | |--- date > 1518048013153468416.00 | | | | | | | | | | |--- date <= 1525996795027521536.00 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | | | |--- date > 1525996795027521536.00 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | |--- no_of_week_nights > 8.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- lead_time > 2.50 | | | | | | | |--- room_type_reserved_Room_Type 4 <= 0.50 | | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | |--- weights: [24.00, 0.00] class: 0 | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | |--- avg_price_per_room <= 145.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- avg_price_per_room > 145.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | |--- weights: [9.00, 0.00] class: 0 | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | |--- avg_price_per_room <= 89.80 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | | |--- avg_price_per_room > 89.80 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | |--- room_type_reserved_Room_Type 4 > 0.50 | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | |--- avg_price_per_room <= 53.00 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 53.00 | | | | | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | |--- lead_time > 3.50 | | | | | | |--- date <= 1517659198354096128.00 | | | | | | | |--- date <= 1503835176097021952.00 | | | | | | | | |--- lead_time <= 5.50 | | | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | | | |--- avg_price_per_room <= 102.94 | | | | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | | | | | | |--- avg_price_per_room > 102.94 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- lead_time > 5.50 | | | | | | | | | |--- avg_price_per_room <= 157.50 | | | | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 157.50 | | | | | | | | | | |--- required_car_parking_space <= 0.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | | |--- required_car_parking_space > 0.50 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | |--- date > 1503835176097021952.00 | | | | | | | | |--- date <= 1510531201910177792.00 | | | | | | | | | |--- date <= 1510315216594796544.00 | | | | | | | | | | |--- date <= 1505433591125901312.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- date > 1505433591125901312.00 | | | | | | | | | | | |--- weights: [26.00, 0.00] class: 0 | | | | | | | | | |--- date > 1510315216594796544.00 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- date > 1510531201910177792.00 | | | | | | | | | |--- weights: [62.00, 0.00] class: 0 | | | | | | |--- date > 1517659198354096128.00 | | | | | | | |--- date <= 1537272011892457472.00 | | | | | | | | |--- avg_price_per_room <= 100.00 | | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | | |--- lead_time <= 8.50 | | | | | | | | | | | |--- truncated branch of depth 9 | | | | | | | | | | |--- lead_time > 8.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | | |--- date <= 1527897575754039296.00 | | | | | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | | | | | | |--- date > 1527897575754039296.00 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 100.00 | | | | | | | | | |--- no_of_week_nights <= 3.50 | | | | | | | | | | |--- no_of_adults <= 2.50 | | | | | | | | | | | |--- truncated branch of depth 16 | | | | | | | | | | |--- no_of_adults > 2.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | |--- no_of_week_nights > 3.50 | | | | | | | | | | |--- avg_price_per_room <= 106.42 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- avg_price_per_room > 106.42 | | | | | | | | | | | |--- weights: [0.00, 14.00] class: 1 | | | | | | | |--- date > 1537272011892457472.00 | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | |--- lead_time <= 7.50 | | | | | | | | | | |--- weights: [29.00, 0.00] class: 0 | | | | | | | | | |--- lead_time > 7.50 | | | | | | | | | | |--- room_type_reserved_Room_Type 4 <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- room_type_reserved_Room_Type 4 > 0.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | |--- date <= 1542715212925239296.00 | | | | | | | | | | |--- lead_time <= 4.50 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 4.50 | | | | | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | | | | | |--- date > 1542715212925239296.00 | | | | | | | | | | |--- avg_price_per_room <= 67.10 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | | |--- avg_price_per_room > 67.10 | | | | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | |--- avg_price_per_room > 202.67 | | | | | |--- date <= 1543881588603879424.00 | | | | | | |--- weights: [0.00, 21.00] class: 1 | | | | | |--- date > 1543881588603879424.00 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | |--- lead_time > 9.50 | | | | |--- date <= 1517399988487847936.00 | | | | | |--- lead_time <= 76.50 | | | | | | |--- date <= 1499644799844614144.00 | | | | | | | |--- no_of_week_nights <= 3.50 | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | |--- no_of_week_nights > 3.50 | | | | | | | | |--- lead_time <= 57.50 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- lead_time > 57.50 | | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- date > 1499644799844614144.00 | | | | | | | |--- lead_time <= 24.50 | | | | | | | | |--- no_of_adults <= 2.50 | | | | | | | | | |--- avg_price_per_room <= 104.75 | | | | | | | | | | |--- weights: [96.00, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 104.75 | | | | | | | | | | |--- date <= 1507852791584915456.00 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | | | |--- date > 1507852791584915456.00 | | | | | | | | | | | |--- weights: [16.00, 0.00] class: 0 | | | | | | | | |--- no_of_adults > 2.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- lead_time > 24.50 | | | | | | | | |--- date <= 1514635197780328448.00 | | | | | | | | | |--- avg_price_per_room <= 204.58 | | | | | | | | | | |--- lead_time <= 62.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | | |--- lead_time > 62.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | |--- avg_price_per_room > 204.58 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- date > 1514635197780328448.00 | | | | | | | | | |--- avg_price_per_room <= 77.15 | | | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 77.15 | | | | | | | | | | |--- lead_time <= 44.00 | | | | | | | | | | | |--- weights: [0.00, 7.00] class: 1 | | | | | | | | | | |--- lead_time > 44.00 | | | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | |--- lead_time > 76.50 | | | | | | |--- date <= 1501199990322626560.00 | | | | | | | |--- weights: [0.00, 19.00] class: 1 | | | | | | |--- date > 1501199990322626560.00 | | | | | | | |--- type_of_meal_plan_Meal Plan 2 <= 0.50 | | | | | | | | |--- date <= 1513080007302316032.00 | | | | | | | | | |--- lead_time <= 79.50 | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | |--- lead_time > 79.50 | | | | | | | | | | |--- lead_time <= 84.00 | | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 84.00 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | |--- date > 1513080007302316032.00 | | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | | |--- type_of_meal_plan_Meal Plan 2 > 0.50 | | | | | | | | |--- weights: [0.00, 8.00] class: 1 | | | | |--- date > 1517399988487847936.00 | | | | | |--- required_car_parking_space <= 0.50 | | | | | | |--- avg_price_per_room <= 104.97 | | | | | | | |--- avg_price_per_room <= 56.57 | | | | | | | | |--- avg_price_per_room <= 54.08 | | | | | | | | | |--- weights: [28.00, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 54.08 | | | | | | | | | |--- avg_price_per_room <= 54.17 | | | | | | | | | | |--- lead_time <= 69.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | | |--- lead_time > 69.50 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 54.17 | | | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 56.57 | | | | | | | | |--- lead_time <= 19.50 | | | | | | | | | |--- date <= 1543449617973116928.00 | | | | | | | | | | |--- no_of_week_nights <= 4.50 | | | | | | | | | | | |--- truncated branch of depth 17 | | | | | | | | | | |--- no_of_week_nights > 4.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- date > 1543449617973116928.00 | | | | | | | | | | |--- weights: [41.00, 0.00] class: 0 | | | | | | | | |--- lead_time > 19.50 | | | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | | | |--- date <= 1528113629788897280.00 | | | | | | | | | | | |--- truncated branch of depth 20 | | | | | | | | | | |--- date > 1528113629788897280.00 | | | | | | | | | | | |--- truncated branch of depth 14 | | | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | | |--- truncated branch of depth 21 | | | | | | |--- avg_price_per_room > 104.97 | | | | | | | |--- date <= 1540296012466225152.00 | | | | | | | | |--- avg_price_per_room <= 195.12 | | | | | | | | | |--- room_type_reserved_Room_Type 5 <= 0.50 | | | | | | | | | | |--- lead_time <= 135.50 | | | | | | | | | | | |--- truncated branch of depth 34 | | | | | | | | | | |--- lead_time > 135.50 | | | | | | | | | | | |--- truncated branch of depth 7 | | | | | | | | | |--- room_type_reserved_Room_Type 5 > 0.50 | | | | | | | | | | |--- lead_time <= 76.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- lead_time > 76.50 | | | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | | |--- avg_price_per_room > 195.12 | | | | | | | | | |--- no_of_adults <= 1.50 | | | | | | | | | | |--- lead_time <= 59.50 | | | | | | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | | | | | | | |--- lead_time > 59.50 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | |--- no_of_adults > 1.50 | | | | | | | | | | |--- weights: [0.00, 79.00] class: 1 | | | | | | | |--- date > 1540296012466225152.00 | | | | | | | | |--- lead_time <= 46.50 | | | | | | | | | |--- avg_price_per_room <= 146.90 | | | | | | | | | | |--- room_type_reserved_Room_Type 1 <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- room_type_reserved_Room_Type 1 > 0.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | |--- avg_price_per_room > 146.90 | | | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | |--- lead_time > 46.50 | | | | | | | | | |--- lead_time <= 56.50 | | | | | | | | | | |--- weights: [0.00, 8.00] class: 1 | | | | | | | | | |--- lead_time > 56.50 | | | | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | |--- required_car_parking_space > 0.50 | | | | | | |--- weights: [56.00, 0.00] class: 0 | |--- no_of_special_requests > 0.50 | | |--- date <= 1501070385389502464.00 | | | |--- avg_price_per_room <= 61.33 | | | | |--- no_of_weekend_nights <= 2.50 | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | |--- no_of_weekend_nights > 2.50 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | |--- avg_price_per_room > 61.33 | | | | |--- avg_price_per_room <= 93.80 | | | | | |--- no_of_special_requests <= 1.50 | | | | | | |--- weights: [0.00, 31.00] class: 1 | | | | | |--- no_of_special_requests > 1.50 | | | | | | |--- lead_time <= 84.00 | | | | | | | |--- lead_time <= 65.50 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | |--- lead_time > 65.50 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- lead_time > 84.00 | | | | | | | |--- weights: [0.00, 8.00] class: 1 | | | | |--- avg_price_per_room > 93.80 | | | | | |--- lead_time <= 91.00 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- lead_time > 91.00 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | |--- date > 1501070385389502464.00 | | | |--- no_of_special_requests <= 1.50 | | | | |--- lead_time <= 9.50 | | | | | |--- lead_time <= 4.50 | | | | | | |--- no_of_weekend_nights <= 3.50 | | | | | | | |--- avg_price_per_room <= 80.90 | | | | | | | | |--- weights: [257.00, 0.00] class: 0 | | | | | | | |--- avg_price_per_room > 80.90 | | | | | | | | |--- avg_price_per_room <= 81.15 | | | | | | | | | |--- date <= 1518004857322078208.00 | | | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | | | | |--- date > 1518004857322078208.00 | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | |--- avg_price_per_room > 81.15 | | | | | | | | | |--- room_type_reserved_Room_Type 2 <= 0.50 | | | | | | | | | | |--- no_of_week_nights <= 4.50 | | | | | | | | | | | |--- truncated branch of depth 11 | | | | | | | | | | |--- no_of_week_nights > 4.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- room_type_reserved_Room_Type 2 > 0.50 | | | | | | | | | | |--- lead_time <= 2.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | | |--- lead_time > 2.50 | | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | |--- no_of_weekend_nights > 3.50 | | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- lead_time > 4.50 | | | | | | |--- date <= 1502971166116020224.00 | | | | | | | |--- date <= 1502409590552133632.00 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | |--- date > 1502409590552133632.00 | | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | | |--- avg_price_per_room <= 94.00 | | | | | | | | | | |--- no_of_week_nights <= 3.00 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | | |--- no_of_week_nights > 3.00 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- avg_price_per_room > 94.00 | | | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | | |--- avg_price_per_room <= 79.85 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- avg_price_per_room > 79.85 | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | |--- date > 1502971166116020224.00 | | | | | | | |--- avg_price_per_room <= 138.79 | | | | | | | | |--- avg_price_per_room <= 108.45 | | | | | | | | | |--- no_of_week_nights <= 10.50 | | | | | | | | | | |--- date <= 1517745578736353280.00 | | | | | | | | | | | |--- weights: [74.00, 0.00] class: 0 | | | | | | | | | | |--- date > 1517745578736353280.00 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | | |--- no_of_week_nights > 10.50 | | | | | | | | | | |--- lead_time <= 5.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | | |--- lead_time > 5.50 | | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 108.45 | | | | | | | | | |--- date <= 1505044776326529024.00 | | | | | | | | | | |--- no_of_week_nights <= 3.50 | | | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | | | | |--- no_of_week_nights > 3.50 | | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | | |--- date > 1505044776326529024.00 | | | | | | | | | | |--- avg_price_per_room <= 109.17 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- avg_price_per_room > 109.17 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | |--- avg_price_per_room > 138.79 | | | | | | | | |--- date <= 1529798425200033792.00 | | | | | | | | | |--- avg_price_per_room <= 192.00 | | | | | | | | | | |--- weights: [24.00, 0.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 192.00 | | | | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- date > 1529798425200033792.00 | | | | | | | | | |--- date <= 1530662366461558784.00 | | | | | | | | | | |--- date <= 1530403156595310592.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- date > 1530403156595310592.00 | | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | |--- date > 1530662366461558784.00 | | | | | | | | | | |--- avg_price_per_room <= 159.50 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | | | |--- avg_price_per_room > 159.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | |--- lead_time > 9.50 | | | | | |--- market_segment_type_Online <= 0.50 | | | | | | |--- type_of_meal_plan_Not Selected <= 0.50 | | | | | | | |--- lead_time <= 102.50 | | | | | | | | |--- avg_price_per_room <= 166.22 | | | | | | | | | |--- lead_time <= 94.00 | | | | | | | | | | |--- weights: [355.00, 0.00] class: 0 | | | | | | | | | |--- lead_time > 94.00 | | | | | | | | | | |--- no_of_children <= 0.50 | | | | | | | | | | | |--- weights: [16.00, 0.00] class: 0 | | | | | | | | | | |--- no_of_children > 0.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- avg_price_per_room > 166.22 | | | | | | | | | |--- no_of_adults <= 2.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- no_of_adults > 2.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | |--- lead_time > 102.50 | | | | | | | | |--- lead_time <= 108.00 | | | | | | | | | |--- date <= 1530705659731902464.00 | | | | | | | | | | |--- lead_time <= 105.00 | | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | | |--- lead_time > 105.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- date > 1530705659731902464.00 | | | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | | | |--- lead_time > 108.00 | | | | | | | | | |--- date <= 1522022404090494976.00 | | | | | | | | | | |--- weights: [20.00, 0.00] class: 0 | | | | | | | | | |--- date > 1522022404090494976.00 | | | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 > 0.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | |--- type_of_meal_plan_Not Selected > 0.50 | | | | | | | |--- lead_time <= 60.50 | | | | | | | | |--- avg_price_per_room <= 97.95 | | | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 97.95 | | | | | | | | | |--- lead_time <= 12.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | |--- lead_time > 12.50 | | | | | | | | | | |--- weights: [1.00, 1.00] class: 0 | | | | | | | |--- lead_time > 60.50 | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | |--- market_segment_type_Online > 0.50 | | | | | | |--- no_of_week_nights <= 9.50 | | | | | | | |--- date <= 1534679981949452288.00 | | | | | | | | |--- required_car_parking_space <= 0.50 | | | | | | | | | |--- avg_price_per_room <= 137.25 | | | | | | | | | | |--- lead_time <= 128.50 | | | | | | | | | | | |--- truncated branch of depth 24 | | | | | | | | | | |--- lead_time > 128.50 | | | | | | | | | | | |--- truncated branch of depth 11 | | | | | | | | | |--- avg_price_per_room > 137.25 | | | | | | | | | | |--- date <= 1532519991356686336.00 | | | | | | | | | | | |--- truncated branch of depth 14 | | | | | | | | | | |--- date > 1532519991356686336.00 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | |--- required_car_parking_space > 0.50 | | | | | | | | | |--- weights: [131.00, 0.00] class: 0 | | | | | | | |--- date > 1534679981949452288.00 | | | | | | | | |--- date <= 1543579222906241024.00 | | | | | | | | | |--- avg_price_per_room <= 118.69 | | | | | | | | | | |--- date <= 1538567992504221696.00 | | | | | | | | | | | |--- truncated branch of depth 14 | | | | | | | | | | |--- date > 1538567992504221696.00 | | | | | | | | | | | |--- truncated branch of depth 19 | | | | | | | | | |--- avg_price_per_room > 118.69 | | | | | | | | | | |--- required_car_parking_space <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 23 | | | | | | | | | | |--- required_car_parking_space > 0.50 | | | | | | | | | | | |--- weights: [21.00, 0.00] class: 0 | | | | | | | | |--- date > 1543579222906241024.00 | | | | | | | | | |--- lead_time <= 100.00 | | | | | | | | | | |--- no_of_week_nights <= 8.50 | | | | | | | | | | | |--- weights: [228.00, 0.00] class: 0 | | | | | | | | | | |--- no_of_week_nights > 8.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- lead_time > 100.00 | | | | | | | | | | |--- lead_time <= 136.50 | | | | | | | | | | | |--- truncated branch of depth 7 | | | | | | | | | | |--- lead_time > 136.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | |--- no_of_week_nights > 9.50 | | | | | | | |--- weights: [0.00, 14.00] class: 1 | | | |--- no_of_special_requests > 1.50 | | | | |--- lead_time <= 89.50 | | | | | |--- no_of_week_nights <= 3.50 | | | | | | |--- weights: [1918.00, 0.00] class: 0 | | | | | |--- no_of_week_nights > 3.50 | | | | | | |--- no_of_week_nights <= 9.50 | | | | | | | |--- no_of_special_requests <= 2.50 | | | | | | | | |--- no_of_weekend_nights <= 0.50 | | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 <= 0.50 | | | | | | | | | | |--- avg_price_per_room <= 124.45 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | | |--- avg_price_per_room > 124.45 | | | | | | | | | | | |--- truncated branch of depth 7 | | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 > 0.50 | | | | | | | | | | |--- avg_price_per_room <= 129.71 | | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 129.71 | | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | |--- no_of_weekend_nights > 0.50 | | | | | | | | | |--- lead_time <= 8.50 | | | | | | | | | | |--- no_of_children <= 0.50 | | | | | | | | | | | |--- weights: [20.00, 0.00] class: 0 | | | | | | | | | | |--- no_of_children > 0.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- lead_time > 8.50 | | | | | | | | | | |--- date <= 1504051161412403200.00 | | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | | |--- date > 1504051161412403200.00 | | | | | | | | | | | |--- truncated branch of depth 13 | | | | | | | |--- no_of_special_requests > 2.50 | | | | | | | | |--- weights: [64.00, 0.00] class: 0 | | | | | | |--- no_of_week_nights > 9.50 | | | | | | | |--- no_of_special_requests <= 2.50 | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | | | |--- no_of_special_requests > 2.50 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- lead_time > 89.50 | | | | | |--- avg_price_per_room <= 202.14 | | | | | | |--- date <= 1536494382293712896.00 | | | | | | | |--- no_of_week_nights <= 6.50 | | | | | | | | |--- date <= 1516233612809207808.00 | | | | | | | | | |--- no_of_week_nights <= 3.50 | | | | | | | | | | |--- lead_time <= 105.50 | | | | | | | | | | | |--- weights: [18.00, 0.00] class: 0 | | | | | | | | | | |--- lead_time > 105.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | |--- no_of_week_nights > 3.50 | | | | | | | | | | |--- lead_time <= 116.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- lead_time > 116.00 | | | | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | | | |--- date > 1516233612809207808.00 | | | | | | | | | |--- date <= 1535284782064205824.00 | | | | | | | | | | |--- date <= 1530791971394682880.00 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | | | |--- date > 1530791971394682880.00 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | |--- date > 1535284782064205824.00 | | | | | | | | | | |--- date <= 1535414386997329920.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- date > 1535414386997329920.00 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | |--- no_of_week_nights > 6.50 | | | | | | | | |--- avg_price_per_room <= 130.68 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 130.68 | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- date > 1536494382293712896.00 | | | | | | | |--- no_of_special_requests <= 2.50 | | | | | | | | |--- room_type_reserved_Room_Type 1 <= 0.50 | | | | | | | | | |--- lead_time <= 91.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- lead_time > 91.50 | | | | | | | | | | |--- avg_price_per_room <= 105.01 | | | | | | | | | | | |--- weights: [10.00, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 105.01 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | |--- room_type_reserved_Room_Type 1 > 0.50 | | | | | | | | | |--- date <= 1545523228183625728.00 | | | | | | | | | | |--- lead_time <= 112.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | | |--- lead_time > 112.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | |--- date > 1545523228183625728.00 | | | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | | |--- no_of_special_requests > 2.50 | | | | | | | | |--- weights: [34.00, 0.00] class: 0 | | | | | |--- avg_price_per_room > 202.14 | | | | | | |--- weights: [0.00, 6.00] class: 1 |--- lead_time > 150.50 | |--- avg_price_per_room <= 100.04 | | |--- no_of_special_requests <= 0.50 | | | |--- market_segment_type_Online <= 0.50 | | | | |--- avg_price_per_room <= 89.76 | | | | | |--- lead_time <= 402.00 | | | | | | |--- lead_time <= 274.00 | | | | | | | |--- avg_price_per_room <= 27.07 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- avg_price_per_room > 27.07 | | | | | | | | |--- no_of_week_nights <= 9.50 | | | | | | | | | |--- lead_time <= 161.50 | | | | | | | | | | |--- weights: [29.00, 0.00] class: 0 | | | | | | | | | |--- lead_time > 161.50 | | | | | | | | | | |--- lead_time <= 171.50 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | | | |--- lead_time > 171.50 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | |--- no_of_week_nights > 9.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- lead_time > 274.00 | | | | | | | |--- lead_time <= 318.50 | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 <= 0.50 | | | | | | | | | |--- lead_time <= 293.50 | | | | | | | | | | |--- lead_time <= 277.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- lead_time > 277.50 | | | | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | | | | |--- lead_time > 293.50 | | | | | | | | | | |--- date <= 1530403156595310592.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- date > 1530403156595310592.00 | | | | | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | | | | |--- type_of_meal_plan_Meal Plan 2 > 0.50 | | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | | | |--- lead_time > 318.50 | | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | |--- lead_time > 402.00 | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | |--- avg_price_per_room > 89.76 | | | | | |--- date <= 1528200010171154432.00 | | | | | | |--- date <= 1515844798009835520.00 | | | | | | | |--- weights: [1.00, 1.00] class: 0 | | | | | | |--- date > 1515844798009835520.00 | | | | | | | |--- weights: [0.00, 8.00] class: 1 | | | | | |--- date > 1528200010171154432.00 | | | | | | |--- date <= 1537185562790723584.00 | | | | | | | |--- lead_time <= 297.00 | | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | | |--- lead_time <= 243.00 | | | | | | | | | | |--- avg_price_per_room <= 90.77 | | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | | |--- avg_price_per_room > 90.77 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- lead_time > 243.00 | | | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | | | |--- lead_time > 297.00 | | | | | | | | |--- lead_time <= 333.00 | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | |--- lead_time > 333.00 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- date > 1537185562790723584.00 | | | | | | | |--- lead_time <= 291.00 | | | | | | | | |--- lead_time <= 224.00 | | | | | | | | | |--- lead_time <= 156.00 | | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | | | |--- lead_time > 156.00 | | | | | | | | | | |--- avg_price_per_room <= 96.24 | | | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 96.24 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- lead_time > 224.00 | | | | | | | | | |--- weights: [0.00, 9.00] class: 1 | | | | | | | |--- lead_time > 291.00 | | | | | | | | |--- date <= 1539432002485223424.00 | | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | | | | |--- date > 1539432002485223424.00 | | | | | | | | | |--- no_of_weekend_nights <= 1.00 | | | | | | | | | | |--- weights: [1.00, 1.00] class: 0 | | | | | | | | | |--- no_of_weekend_nights > 1.00 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | |--- market_segment_type_Online > 0.50 | | | | |--- avg_price_per_room <= 2.50 | | | | | |--- lead_time <= 285.50 | | | | | | |--- type_of_meal_plan_Meal Plan 1 <= 0.50 | | | | | | | |--- no_of_weekend_nights <= 1.50 | | | | | | | | |--- weights: [1.00, 1.00] class: 0 | | | | | | | |--- no_of_weekend_nights > 1.50 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- type_of_meal_plan_Meal Plan 1 > 0.50 | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | |--- lead_time > 285.50 | | | | | | |--- date <= 1506556810973151232.00 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- date > 1506556810973151232.00 | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | |--- avg_price_per_room > 2.50 | | | | | |--- no_of_adults <= 2.50 | | | | | | |--- date <= 1543751983670755328.00 | | | | | | | |--- weights: [0.00, 340.00] class: 1 | | | | | | |--- date > 1543751983670755328.00 | | | | | | | |--- date <= 1543967968986136576.00 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | |--- date > 1543967968986136576.00 | | | | | | | | |--- lead_time <= 215.50 | | | | | | | | | |--- lead_time <= 212.50 | | | | | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | | | | | | |--- lead_time > 212.50 | | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | |--- lead_time > 215.50 | | | | | | | | | |--- weights: [0.00, 43.00] class: 1 | | | | | |--- no_of_adults > 2.50 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | |--- no_of_special_requests > 0.50 | | | |--- no_of_weekend_nights <= 0.50 | | | | |--- lead_time <= 180.50 | | | | | |--- date <= 1541678442179723264.00 | | | | | | |--- avg_price_per_room <= 99.83 | | | | | | | |--- lead_time <= 176.00 | | | | | | | | |--- weights: [39.00, 0.00] class: 0 | | | | | | | |--- lead_time > 176.00 | | | | | | | | |--- date <= 1533211240573173760.00 | | | | | | | | | |--- no_of_special_requests <= 1.50 | | | | | | | | | | |--- avg_price_per_room <= 97.80 | | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | | |--- avg_price_per_room > 97.80 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | |--- no_of_special_requests > 1.50 | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- date > 1533211240573173760.00 | | | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | | |--- avg_price_per_room > 99.83 | | | | | | | |--- type_of_meal_plan_Meal Plan 1 <= 0.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- type_of_meal_plan_Meal Plan 1 > 0.50 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- date > 1541678442179723264.00 | | | | | | |--- avg_price_per_room <= 82.10 | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | |--- avg_price_per_room > 82.10 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | |--- lead_time > 180.50 | | | | | |--- no_of_special_requests <= 2.50 | | | | | | |--- market_segment_type_Online <= 0.50 | | | | | | | |--- no_of_adults <= 2.50 | | | | | | | | |--- lead_time <= 356.00 | | | | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | | | | | | |--- lead_time > 356.00 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- no_of_adults > 2.50 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- market_segment_type_Online > 0.50 | | | | | | | |--- date <= 1544097573919260672.00 | | | | | | | | |--- no_of_week_nights <= 0.50 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- no_of_week_nights > 0.50 | | | | | | | | | |--- weights: [0.00, 89.00] class: 1 | | | | | | | |--- date > 1544097573919260672.00 | | | | | | | | |--- lead_time <= 301.50 | | | | | | | | | |--- type_of_meal_plan_Meal Plan 1 <= 0.50 | | | | | | | | | | |--- avg_price_per_room <= 75.87 | | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 75.87 | | | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | | | |--- type_of_meal_plan_Meal Plan 1 > 0.50 | | | | | | | | | | |--- no_of_children <= 0.50 | | | | | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | | | | | | |--- no_of_children > 0.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- lead_time > 301.50 | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- no_of_special_requests > 2.50 | | | | | | |--- weights: [9.00, 0.00] class: 0 | | | |--- no_of_weekend_nights > 0.50 | | | | |--- date <= 1544788823135748096.00 | | | | | |--- no_of_week_nights <= 9.50 | | | | | | |--- avg_price_per_room <= 94.21 | | | | | | | |--- date <= 1540641602714730496.00 | | | | | | | | |--- lead_time <= 372.50 | | | | | | | | | |--- no_of_week_nights <= 5.50 | | | | | | | | | | |--- lead_time <= 165.50 | | | | | | | | | | | |--- truncated branch of depth 11 | | | | | | | | | | |--- lead_time > 165.50 | | | | | | | | | | | |--- truncated branch of depth 10 | | | | | | | | | |--- no_of_week_nights > 5.50 | | | | | | | | | | |--- avg_price_per_room <= 81.00 | | | | | | | | | | | |--- weights: [11.00, 0.00] class: 0 | | | | | | | | | | |--- avg_price_per_room > 81.00 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | |--- lead_time > 372.50 | | | | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | | | | |--- weights: [1.00, 1.00] class: 0 | | | | | | | |--- date > 1540641602714730496.00 | | | | | | | | |--- avg_price_per_room <= 63.15 | | | | | | | | | |--- weights: [9.00, 0.00] class: 0 | | | | | | | | |--- avg_price_per_room > 63.15 | | | | | | | | | |--- date <= 1540727983096987648.00 | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | |--- date > 1540727983096987648.00 | | | | | | | | | | |--- no_of_special_requests <= 1.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- no_of_special_requests > 1.50 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | |--- avg_price_per_room > 94.21 | | | | | | | |--- type_of_meal_plan_Meal Plan 1 <= 0.50 | | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | | |--- date <= 1531439996060303360.00 | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | | |--- date > 1531439996060303360.00 | | | | | | | | | | |--- date <= 1533816040687927296.00 | | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | | |--- date > 1533816040687927296.00 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | | | |--- type_of_meal_plan_Meal Plan 1 > 0.50 | | | | | | | | |--- date <= 1540166407533101056.00 | | | | | | | | | |--- no_of_week_nights <= 2.50 | | | | | | | | | | |--- date <= 1538092797322592256.00 | | | | | | | | | | | |--- truncated branch of depth 8 | | | | | | | | | | |--- date > 1538092797322592256.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- no_of_week_nights > 2.50 | | | | | | | | | | |--- date <= 1534723206500319232.00 | | | | | | | | | | | |--- truncated branch of depth 6 | | | | | | | | | | |--- date > 1534723206500319232.00 | | | | | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | | | | |--- date > 1540166407533101056.00 | | | | | | | | | |--- no_of_special_requests <= 2.50 | | | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | | | | | |--- no_of_special_requests > 2.50 | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- no_of_week_nights > 9.50 | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | |--- date > 1544788823135748096.00 | | | | | |--- date <= 1546084803747512320.00 | | | | | | |--- lead_time <= 281.50 | | | | | | | |--- date <= 1545912042982998016.00 | | | | | | | | |--- lead_time <= 252.50 | | | | | | | | | |--- avg_price_per_room <= 62.81 | | | | | | | | | | |--- lead_time <= 245.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | | |--- lead_time > 245.50 | | | | | | | | | | | |--- weights: [1.00, 1.00] class: 0 | | | | | | | | | |--- avg_price_per_room > 62.81 | | | | | | | | | | |--- no_of_week_nights <= 1.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | | |--- no_of_week_nights > 1.50 | | | | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | | | |--- lead_time > 252.50 | | | | | | | | | |--- no_of_week_nights <= 1.00 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | |--- no_of_week_nights > 1.00 | | | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | | | |--- date > 1545912042982998016.00 | | | | | | | | |--- no_of_children <= 1.50 | | | | | | | | | |--- weights: [9.00, 0.00] class: 0 | | | | | | | | |--- no_of_children > 1.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- lead_time > 281.50 | | | | | | | |--- market_segment_type_Offline <= 0.50 | | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | | | |--- market_segment_type_Offline > 0.50 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- date > 1546084803747512320.00 | | | | | | |--- weights: [0.00, 5.00] class: 1 | |--- avg_price_per_room > 100.04 | | |--- no_of_special_requests <= 2.50 | | | |--- date <= 1543492773804507136.00 | | | | |--- date <= 1518220773917982720.00 | | | | | |--- date <= 1511395211891179520.00 | | | | | | |--- weights: [0.00, 13.00] class: 1 | | | | | |--- date > 1511395211891179520.00 | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | |--- date > 1518220773917982720.00 | | | | | |--- weights: [0.00, 962.00] class: 1 | | | |--- date > 1543492773804507136.00 | | | | |--- no_of_special_requests <= 0.50 | | | | | |--- weights: [35.00, 0.00] class: 0 | | | | |--- no_of_special_requests > 0.50 | | | | | |--- room_type_reserved_Room_Type 1 <= 0.50 | | | | | | |--- no_of_special_requests <= 1.50 | | | | | | | |--- weights: [0.00, 13.00] class: 1 | | | | | | |--- no_of_special_requests > 1.50 | | | | | | | |--- avg_price_per_room <= 110.74 | | | | | | | | |--- no_of_week_nights <= 4.00 | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | |--- no_of_week_nights > 4.00 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- avg_price_per_room > 110.74 | | | | | | | | |--- lead_time <= 162.00 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- lead_time > 162.00 | | | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | |--- room_type_reserved_Room_Type 1 > 0.50 | | | | | | |--- lead_time <= 151.50 | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | |--- lead_time > 151.50 | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | |--- no_of_special_requests > 2.50 | | | |--- weights: [32.00, 0.00] class: 0
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
## Function to create confusion matrix
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
## Function to calculate recall score
def get_recall_score(model):
'''
model : classifier to predict values of X
'''
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
print("Recall on test set : ",metrics.recall_score(y_test,pred_test))
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"max_depth": [np.arange(2, 50, 5), None],
"criterion": ["entropy", "gini"],
"splitter": ["best", "random"],
"min_impurity_decrease": [0.000001, 0.00001, 0.0001],
}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train11, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(min_impurity_decrease=1e-05, random_state=1)
decision_tree_tune_perf_train = model_performance_classification_sklearn(
estimator, X_train, y_train
)
decision_tree_tune_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.993783 | 0.978683 | 0.999806 | 0.989131 |
confusion_matrix_sklearn(estimator, X_train, y_train)
decision_tree_tune_perf_test = model_performance_classification_sklearn(
estimator, X_test, y_test
)
decision_tree_tune_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.821566 | 0.681797 | 0.681173 | 0.681485 |
confusion_matrix_sklearn(estimator, X_test, y_test)
# Accuracy on train and test
print("Accuracy on training set : ",dTree1.score(X_train, y_train))
print("Accuracy on test set : ",dTree1.score(X_test, y_test))
# Recall on train and test
get_recall_score(dTree1)
dTree1 = DecisionTreeClassifier(criterion = 'gini',max_depth=3,random_state=1)
dTree1.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=3, random_state=1)
make_confusion_matrix(dTree1, y_test)
# Accuracy on train and test
print("Accuracy on training set : ",dTree1.score(X_train, y_train))
print("Accuracy on test set : ",dTree1.score(X_test, y_test))
# Recall on train and test
get_recall_score(dTree1)
Accuracy on training set : 0.78585969738652 Accuracy on test set : 0.7880616174582799 Recall on training set : 0.724971450323563 Recall on test set : 0.7235213204951857
plt.figure(figsize=(15,10))
tree.plot_tree(dTree1,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(dTree1,feature_names=feature_names,show_weights=True))
|--- lead_time <= 150.50 | |--- no_of_special_requests <= 0.50 | | |--- market_segment_type_Online <= 0.50 | | | |--- weights: [2504.00, 245.00] class: 0 | | |--- market_segment_type_Online > 0.50 | | | |--- weights: [2202.00, 2313.00] class: 1 | |--- no_of_special_requests > 0.50 | | |--- date <= 1501070385389502464.00 | | | |--- weights: [5.00, 43.00] class: 1 | | |--- date > 1501070385389502464.00 | | | |--- weights: [7512.00, 1003.00] class: 0 |--- lead_time > 150.50 | |--- avg_price_per_room <= 100.04 | | |--- no_of_special_requests <= 0.50 | | | |--- weights: [193.00, 456.00] class: 1 | | |--- no_of_special_requests > 0.50 | | | |--- weights: [426.00, 197.00] class: 0 | |--- avg_price_per_room > 100.04 | | |--- no_of_special_requests <= 2.50 | | | |--- weights: [47.00, 997.00] class: 1 | | |--- no_of_special_requests > 2.50 | | | |--- weights: [32.00, 0.00] class: 0
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(dTree1.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp lead_time 0.391170 market_segment_type_Online 0.259427 no_of_special_requests 0.240515 avg_price_per_room 0.084389 date 0.024498 room_type_reserved_Room_Type 1 0.000000 market_segment_type_Offline 0.000000 market_segment_type_Corporate 0.000000 market_segment_type_Complementary 0.000000 market_segment_type_Aviation 0.000000 room_type_reserved_Room_Type 7 0.000000 room_type_reserved_Room_Type 6 0.000000 room_type_reserved_Room_Type 5 0.000000 room_type_reserved_Room_Type 4 0.000000 room_type_reserved_Room_Type 3 0.000000 room_type_reserved_Room_Type 2 0.000000 no_of_adults 0.000000 type_of_meal_plan_Not Selected 0.000000 no_of_children 0.000000 type_of_meal_plan_Meal Plan 2 0.000000 type_of_meal_plan_Meal Plan 1 0.000000 no_of_previous_bookings_not_canceled 0.000000 no_of_previous_cancellations 0.000000 repeated_guest 0.000000 required_car_parking_space 0.000000 no_of_week_nights 0.000000 no_of_weekend_nights 0.000000 type_of_meal_plan_Meal Plan 3 0.000000
importances = dTree1.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(10,10))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
You can see in important features of previous model, lead time was also was on top, but here importance of amount variable is zero this is the shortcoming of pre pruning, we just limit it even before knowing the importance of features and split.
This is a good first pass fit, but we will continue with GridSearch to give a full comparison for the model.
from sklearn.model_selection import GridSearchCV
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
## add from article
parameters = {'max_depth': np.arange(1,10),
'min_samples_leaf': [1, 2, 5, 7, 10,15,20],
'max_leaf_nodes' : [2, 3, 5, 10],
'min_impurity_decrease': [0.001,0.01,0.1]
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=3, max_leaf_nodes=5,
min_impurity_decrease=0.001, random_state=1)
make_confusion_matrix(estimator,y_test)
plt.figure(figsize=(15,10))
tree.plot_tree(estimator,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator,feature_names=feature_names,show_weights=True))
|--- lead_time <= 150.50 | |--- no_of_special_requests <= 0.50 | | |--- market_segment_type_Online <= 0.50 | | | |--- weights: [2504.00, 245.00] class: 0 | | |--- market_segment_type_Online > 0.50 | | | |--- weights: [2202.00, 2313.00] class: 1 | |--- no_of_special_requests > 0.50 | | |--- weights: [7517.00, 1046.00] class: 0 |--- lead_time > 150.50 | |--- avg_price_per_room <= 100.04 | | |--- weights: [619.00, 653.00] class: 1 | |--- avg_price_per_room > 100.04 | | |--- weights: [79.00, 997.00] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(estimator.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
#Here we will see that importance of features has increased
Imp lead_time 0.429267 market_segment_type_Online 0.284693 no_of_special_requests 0.193433 avg_price_per_room 0.092608 room_type_reserved_Room_Type 1 0.000000 market_segment_type_Offline 0.000000 market_segment_type_Corporate 0.000000 market_segment_type_Complementary 0.000000 market_segment_type_Aviation 0.000000 room_type_reserved_Room_Type 7 0.000000 room_type_reserved_Room_Type 6 0.000000 room_type_reserved_Room_Type 5 0.000000 room_type_reserved_Room_Type 4 0.000000 room_type_reserved_Room_Type 3 0.000000 room_type_reserved_Room_Type 2 0.000000 no_of_adults 0.000000 type_of_meal_plan_Not Selected 0.000000 no_of_children 0.000000 type_of_meal_plan_Meal Plan 2 0.000000 type_of_meal_plan_Meal Plan 1 0.000000 date 0.000000 no_of_previous_bookings_not_canceled 0.000000 no_of_previous_cancellations 0.000000 repeated_guest 0.000000 required_car_parking_space 0.000000 no_of_week_nights 0.000000 no_of_weekend_nights 0.000000 type_of_meal_plan_Meal Plan 3 0.000000
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000 | 0.003631 |
| 1 | 0.000000 | 0.003631 |
| 2 | 0.000000 | 0.003631 |
| 3 | 0.000000 | 0.003631 |
| 4 | 0.000000 | 0.003631 |
| ... | ... | ... |
| 1108 | 0.006623 | 0.277423 |
| 1109 | 0.010952 | 0.288375 |
| 1110 | 0.015335 | 0.303710 |
| 1111 | 0.028273 | 0.360256 |
| 1112 | 0.050768 | 0.411024 |
1113 rows × 2 columns
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train)
clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]))
Number of nodes in the last tree is: 1 with ccp_alpha: 0.050767983648866255
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1,figsize=(10,7))
ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
fig, ax = plt.subplots(figsize=(10,5))
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
index_best_model = np.argmax(test_scores)
best_model = clfs[index_best_model]
print(best_model)
print('Training accuracy of best model: ',best_model.score(X_train, y_train))
print('Test accuracy of best model: ',best_model.score(X_test, y_test))
DecisionTreeClassifier(ccp_alpha=0.0002428734316759681, random_state=1) Training accuracy of best model: 0.8652544704264099 Test accuracy of best model: 0.8581514762516046
dTree1 = DecisionTreeClassifier(criterion = 'gini',max_depth=3,random_state=1)
dTree1.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=3, random_state=1)
recall_train=[]
for clf in clfs:
pred_train3=clf.predict(X_train)
values_train=metrics.recall_score(y_train,pred_train3)
recall_train.append(values_train)
recall_test=[]
for clf in clfs:
pred_test3=clf.predict(X_test)
values_test=metrics.recall_score(y_test,pred_test3)
recall_test.append(values_test)
fig, ax = plt.subplots(figsize=(15,5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.015334740776881739, random_state=1)
make_confusion_matrix(best_model,y_test)
# Recall on train and test
get_recall_score(best_model)
Recall on training set : 0.7542824514655501 Recall on test set : 0.7524071526822559
plt.figure(figsize=(17,15))
tree.plot_tree(best_model,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(best_model,feature_names=feature_names,show_weights=True))
|--- lead_time <= 150.50 | |--- no_of_special_requests <= 0.50 | | |--- market_segment_type_Online <= 0.50 | | | |--- weights: [2504.00, 245.00] class: 0 | | |--- market_segment_type_Online > 0.50 | | | |--- weights: [2202.00, 2313.00] class: 1 | |--- no_of_special_requests > 0.50 | | |--- weights: [7517.00, 1046.00] class: 0 |--- lead_time > 150.50 | |--- weights: [698.00, 1650.00] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(best_model.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp lead_time 0.473077 market_segment_type_Online 0.313749 no_of_special_requests 0.213174 type_of_meal_plan_Not Selected 0.000000 market_segment_type_Offline 0.000000 market_segment_type_Corporate 0.000000 market_segment_type_Complementary 0.000000 market_segment_type_Aviation 0.000000 room_type_reserved_Room_Type 7 0.000000 room_type_reserved_Room_Type 6 0.000000 room_type_reserved_Room_Type 5 0.000000 room_type_reserved_Room_Type 4 0.000000 room_type_reserved_Room_Type 3 0.000000 room_type_reserved_Room_Type 2 0.000000 room_type_reserved_Room_Type 1 0.000000 no_of_adults 0.000000 no_of_children 0.000000 type_of_meal_plan_Meal Plan 2 0.000000 type_of_meal_plan_Meal Plan 1 0.000000 date 0.000000 avg_price_per_room 0.000000 no_of_previous_bookings_not_canceled 0.000000 no_of_previous_cancellations 0.000000 repeated_guest 0.000000 required_car_parking_space 0.000000 no_of_week_nights 0.000000 no_of_weekend_nights 0.000000 type_of_meal_plan_Meal Plan 3 0.000000
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
comparison_frame = pd.DataFrame({'Model':['Initial decision tree model','Decision tree with restricted maximum depth','Decision treee with hyperparameter tuning',
'Decision tree with post-pruning'], 'Train_Recall':[1,0.53,0.51,0.63], 'Test_Recall':[0.46,0.46,0.46,0.56]})
comparison_frame
| Model | Train_Recall | Test_Recall | |
|---|---|---|---|
| 0 | Initial decision tree model | 1.00 | 0.46 |
| 1 | Decision tree with restricted maximum depth | 0.53 | 0.46 |
| 2 | Decision treee with hyperparameter tuning | 0.51 | 0.46 |
| 3 | Decision tree with post-pruning | 0.63 | 0.56 |
Since we will prioritize Test Recall, the best tree for this dataset would be the decision tree with post-pruning.
The three parameters of most importance when canceling a booking are lead time, Online Market Segment, and number of special requests.
Online booking services are convenient, but perhaps additional requirements should be added for these bookings. A similar penalty for late cancellation could apply to this also in the terms and conditions.
Perhaps keep a system to track number of special requests. Then the booking could be flagged as unreliable or charge additional fees for certain requests.